five

TrishEdith/testDatasetTransfer

收藏
Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/TrishEdith/testDatasetTransfer
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en paperswithcode_id: fever annotations_creators: - crowdsourced language_creators: - found license: - cc-by-sa-3.0 - gpl-3.0 multilinguality: - monolingual pretty_name: FEVER size_categories: - 100K<n<1M source_datasets: - extended|wikipedia task_categories: - text-classification task_ids: [] tags: - knowledge-verification dataset_info: - config_name: v1.0 features: - name: id dtype: int32 - name: label dtype: string - name: claim dtype: string - name: evidence_annotation_id dtype: int32 - name: evidence_id dtype: int32 - name: evidence_wiki_url dtype: string - name: evidence_sentence_id dtype: int32 splits: - name: train num_bytes: 29591412 num_examples: 311431 - name: labelled_dev num_bytes: 3643157 num_examples: 37566 - name: unlabelled_dev num_bytes: 1548965 num_examples: 19998 - name: unlabelled_test num_bytes: 1617002 num_examples: 19998 - name: paper_dev num_bytes: 1821489 num_examples: 18999 - name: paper_test num_bytes: 1821668 num_examples: 18567 download_size: 44853972 dataset_size: 40043693 - config_name: v2.0 features: - name: id dtype: int32 - name: label dtype: string - name: claim dtype: string - name: evidence_annotation_id dtype: int32 - name: evidence_id dtype: int32 - name: evidence_wiki_url dtype: string - name: evidence_sentence_id dtype: int32 splits: - name: validation num_bytes: 306243 num_examples: 2384 download_size: 392466 dataset_size: 306243 - config_name: wiki_pages features: - name: id dtype: string - name: text dtype: string - name: lines dtype: string splits: - name: wikipedia_pages num_bytes: 7254115038 num_examples: 5416537 download_size: 1713485474 dataset_size: 7254115038 --- # Dataset Card for "fever" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://fever.ai/](https://fever.ai/) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Summary With billions of individual pages on the web providing information on almost every conceivable topic, we should have the ability to collect facts that answer almost every conceivable question. However, only a small fraction of this information is contained in structured sources (Wikidata, Freebase, etc.) – we are therefore limited by our ability to transform free-form text to structured knowledge. There is, however, another problem that has become the focus of a lot of recent research and media coverage: false information coming from unreliable sources. The FEVER workshops are a venue for work in verifiable knowledge extraction and to stimulate progress in this direction. - FEVER Dataset: FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. - FEVER 2.0 Adversarial Attacks Dataset: The FEVER 2.0 Dataset consists of 1174 claims created by the submissions of participants in the Breaker phase of the 2019 shared task. Participants (Breakers) were tasked with generating adversarial examples that induce classification errors for the existing systems. Breakers submitted a dataset of up to 1000 instances with equal number of instances for each of the three classes (Supported, Refuted NotEnoughInfo). Only novel claims (i.e. not contained in the original FEVER dataset) were considered as valid entries to the shared task. The submissions were then manually evaluated for Correctness (grammatical, appropriately labeled and meet the FEVER annotation guidelines requirements). ### Supported Tasks and Leaderboards The task is verification of textual claims against textual sources. When compared to textual entailment (TE)/natural language inference, the key difference is that in these tasks the passage to verify each claim is given, and in recent years it typically consists a single sentence, while in verification systems it is retrieved from a large set of documents in order to form the evidence. ### Languages The dataset is in English. ## Dataset Structure ### Data Instances #### v1.0 - **Size of downloaded dataset files:** 44.86 MB - **Size of the generated dataset:** 40.05 MB - **Total amount of disk used:** 84.89 MB An example of 'train' looks as follows. ``` 'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.', 'evidence_wiki_url': 'Nikolaj_Coster-Waldau', 'label': 'SUPPORTS', 'id': 75397, 'evidence_id': 104971, 'evidence_sentence_id': 7, 'evidence_annotation_id': 92206} ``` #### v2.0 - **Size of downloaded dataset files:** 0.39 MB - **Size of the generated dataset:** 0.30 MB - **Total amount of disk used:** 0.70 MB An example of 'validation' looks as follows. ``` {'claim': "There is a convicted statutory rapist called Chinatown's writer.", 'evidence_wiki_url': '', 'label': 'NOT ENOUGH INFO', 'id': 500000, 'evidence_id': -1, 'evidence_sentence_id': -1, 'evidence_annotation_id': 269158} ``` #### wiki_pages - **Size of downloaded dataset files:** 1.71 GB - **Size of the generated dataset:** 7.25 GB - **Total amount of disk used:** 8.97 GB An example of 'wikipedia_pages' looks as follows. ``` {'text': 'The following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world . ', 'lines': '0\tThe following are the football -LRB- soccer -RRB- events of the year 1928 throughout the world .\n1\t', 'id': '1928_in_association_football'} ``` ### Data Fields The data fields are the same among all splits. #### v1.0 - `id`: a `int32` feature. - `label`: a `string` feature. - `claim`: a `string` feature. - `evidence_annotation_id`: a `int32` feature. - `evidence_id`: a `int32` feature. - `evidence_wiki_url`: a `string` feature. - `evidence_sentence_id`: a `int32` feature. #### v2.0 - `id`: a `int32` feature. - `label`: a `string` feature. - `claim`: a `string` feature. - `evidence_annotation_id`: a `int32` feature. - `evidence_id`: a `int32` feature. - `evidence_wiki_url`: a `string` feature. - `evidence_sentence_id`: a `int32` feature. #### wiki_pages - `id`: a `string` feature. - `text`: a `string` feature. - `lines`: a `string` feature. ### Data Splits #### v1.0 | | train | unlabelled_dev | labelled_dev | paper_dev | unlabelled_test | paper_test | |------|-------:|---------------:|-------------:|----------:|----------------:|-----------:| | v1.0 | 311431 | 19998 | 37566 | 18999 | 19998 | 18567 | #### v2.0 | | validation | |------|-----------:| | v2.0 | 2384 | #### wiki_pages | | wikipedia_pages | |------------|----------------:| | wiki_pages | 5416537 | ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information FEVER license: ``` These data annotations incorporate material from Wikipedia, which is licensed pursuant to the Wikipedia Copyright Policy. These annotations are made available under the license terms described on the applicable Wikipedia article pages, or, where Wikipedia license terms are unavailable, under the Creative Commons Attribution-ShareAlike License (version 3.0), available at http://creativecommons.org/licenses/by-sa/3.0/ (collectively, the “License Terms”). You may not use these files except in compliance with the applicable License Terms. ``` ### Citation Information If you use "FEVER Dataset", please cite: ```bibtex @inproceedings{Thorne18Fever, author = {Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit}, title = {{FEVER}: a Large-scale Dataset for Fact Extraction and {VERification}}, booktitle = {NAACL-HLT}, year = {2018} } ``` If you use "FEVER 2.0 Adversarial Attacks Dataset", please cite: ```bibtex @inproceedings{Thorne19FEVER2, author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit}, title = {The {FEVER2.0} Shared Task}, booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}}, year = {2018} } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@lhoestq](https://github.com/lhoestq), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun), [@albertvillanova](https://github.com/albertvillanova) for adding this dataset.
提供机构:
TrishEdith
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作