allenai/scifact

Name: allenai/scifact
Creator: allenai
Published: 2023-12-21 10:19:34
License: 暂无描述

Hugging Face2023-12-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/allenai/scifact

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - en language_creators: - found license: - cc-by-nc-2.0 multilinguality: - monolingual pretty_name: SciFact size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification task_ids: - fact-checking paperswithcode_id: scifact dataset_info: - config_name: corpus features: - name: doc_id dtype: int32 - name: title dtype: string - name: abstract sequence: string - name: structured dtype: bool splits: - name: train num_bytes: 7993572 num_examples: 5183 download_size: 3115079 dataset_size: 7993572 - config_name: claims features: - name: id dtype: int32 - name: claim dtype: string - name: evidence_doc_id dtype: string - name: evidence_label dtype: string - name: evidence_sentences sequence: int32 - name: cited_doc_ids sequence: int32 splits: - name: train num_bytes: 168627 num_examples: 1261 - name: test num_bytes: 33625 num_examples: 300 - name: validation num_bytes: 60360 num_examples: 450 download_size: 3115079 dataset_size: 262612 --- # Dataset Card for "scifact" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://scifact.apps.allenai.org/](https://scifact.apps.allenai.org/) - **Repository:** https://github.com/allenai/scifact - **Paper:** [Fact or Fiction: Verifying Scientific Claims](https://aclanthology.org/2020.emnlp-main.609/) - **Point of Contact:** [David Wadden](mailto:davidw@allenai.org) - **Size of downloaded dataset files:** 6.23 MB - **Size of the generated dataset:** 8.26 MB - **Total amount of disk used:** 14.49 MB ### Dataset Summary SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### claims - **Size of downloaded dataset files:** 3.12 MB - **Size of the generated dataset:** 262.61 kB - **Total amount of disk used:** 3.38 MB An example of 'validation' looks as follows. ``` { "cited_doc_ids": [14717500], "claim": "1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.", "evidence_doc_id": "14717500", "evidence_label": "SUPPORT", "evidence_sentences": [2, 5], "id": 3 } ``` #### corpus - **Size of downloaded dataset files:** 3.12 MB - **Size of the generated dataset:** 7.99 MB - **Total amount of disk used:** 11.11 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "abstract": "[\"Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and res...", "doc_id": 4983, "structured": false, "title": "Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging." } ``` ### Data Fields The data fields are the same among all splits. #### claims - `id`: a `int32` feature. - `claim`: a `string` feature. - `evidence_doc_id`: a `string` feature. - `evidence_label`: a `string` feature. - `evidence_sentences`: a `list` of `int32` features. - `cited_doc_ids`: a `list` of `int32` features. #### corpus - `doc_id`: a `int32` feature. - `title`: a `string` feature. - `abstract`: a `list` of `string` features. - `structured`: a `bool` feature. ### Data Splits #### claims | |train|validation|test| |------|----:|---------:|---:| |claims| 1261| 450| 300| #### corpus | |train| |------|----:| |corpus| 5183| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information https://github.com/allenai/scifact/blob/master/LICENSE.md The SciFact dataset is released under the [CC BY-NC 2.0](https://creativecommons.org/licenses/by-nc/2.0/). By using the SciFact data, you are agreeing to its usage terms. ### Citation Information ``` @inproceedings{wadden-etal-2020-fact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550", } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@lhoestq](https://github.com/lhoestq), [@dwadden](https://github.com/dwadden), [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun) for adding this dataset.

--- 注释创建者： - 专家生成语言： - 英语语言创建者： - 公开获取许可证： - 知识共享署名-非商业性使用2.0协议（CC BY-NC 2.0）多语言属性： - 单语言数据集名称：SciFact 样本量范畴： - 1000 < 样本数 < 10000 源数据集： - 原生数据集任务类别： - 文本分类（text-classification）任务子项： - 事实核查（fact-checking） PapersWithCode编号：scifact 数据集信息： - 配置名称：语料库（corpus）特征字段： - 字段名：doc_id，数据类型：int32（32位整数） - 字段名：title，数据类型：string（字符串） - 字段名：abstract，数据类型：字符串序列 - 字段名：structured，数据类型：bool（布尔值）数据集划分： - 划分名称：训练集（train），字节数：7993572，样本数：5183 下载大小：3115079，生成后数据集大小：7993572 - 配置名称：主张集（claims）特征字段： - 字段名：id，数据类型：int32 - 字段名：claim，数据类型：string - 字段名：evidence_doc_id，数据类型：string - 字段名：evidence_label，数据类型：string - 字段名：evidence_sentences，数据类型：int32序列 - 字段名：cited_doc_ids，数据类型：int32序列数据集划分： - 划分名称：训练集（train），字节数：168627，样本数：1261 - 划分名称：测试集（test），字节数：33625，样本数：300 - 划分名称：验证集（validation），字节数：60360，样本数：450 下载大小：3115079，生成后数据集大小：262612 --- # "SciFact"数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准测试榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段](#data-fields) - [数据集划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [注释标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页：** [https://scifact.apps.allenai.org/](https://scifact.apps.allenai.org/) - **代码仓库：** https://github.com/allenai/scifact - **相关论文：** [《Fact or Fiction: Verifying Scientific Claims》](https://aclanthology.org/2020.emnlp-main.609/) - **联系人：** [David Wadden](mailto:davidw@allenai.org) - **下载数据集文件大小：** 6.23 MB - **生成后数据集大小：** 8.26 MB - **总磁盘占用空间：** 14.49 MB ### 数据集概述 SciFact是一类涵盖1400条专家撰写的科学主张的数据集，每条主张均搭配包含证据的学术摘要，并附带标注标签与论证依据。 ### 支持任务与基准测试榜 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据样例 #### 主张集（claims） - **下载数据集文件大小：** 3.12 MB - **生成后数据集大小：** 262.61 kB - **总磁盘占用空间：** 3.38 MB 以下为验证集的一条样例： { "cited_doc_ids": [14717500], "claim": "1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.", "evidence_doc_id": "14717500", "evidence_label": "SUPPORT", "evidence_sentences": [2, 5], "id": 3 } #### 语料库（corpus） - **下载数据集文件大小：** 3.12 MB - **生成后数据集大小：** 7.99 MB - **总磁盘占用空间：** 11.11 MB 以下为训练集的一条样例： This example was too long and was cropped: { "abstract": "["Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and res...", "doc_id": 4983, "structured": false, "title": "Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging." } ### 数据字段所有数据集划分的字段均保持一致。 #### 主张集（claims） - `id`：32位整数类型特征。 - `claim`：字符串类型特征，即科学主张文本。 - `evidence_doc_id`：字符串类型特征，即证据文档编号。 - `evidence_label`：字符串类型特征，即证据标注标签。 - `evidence_sentences`：32位整数列表类型特征，即证据所在句子的索引。 - `cited_doc_ids`：32位整数列表类型特征，即被引用的文档编号列表。 #### 语料库（corpus） - `doc_id`：32位整数类型特征，即文档编号。 - `title`：字符串类型特征，即文档标题。 - `abstract`：字符串列表类型特征，即文档摘要的分句内容。 - `structured`：布尔类型特征，即文档是否为结构化摘要。 ### 数据集划分 #### 主张集（claims） | | 训练集 | 验证集 | 测试集 | |------|-------:|-------:|-------:| | 主张集 | 1261 | 450 | 300 | #### 语料库（corpus） | | 训练集 | |------|-------:| | 语料库 | 5183 | ## 数据集构建 ### 构建初衷 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁？ [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 注释标注 #### 标注流程 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注者是谁？ [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 https://github.com/allenai/scifact/blob/master/LICENSE.md SciFact数据集采用[知识共享署名-非商业性使用2.0协议（CC BY-NC 2.0）](https://creativecommons.org/licenses/by-nc/2.0/)进行发布。使用SciFact数据集即代表您同意其使用条款。 ### 引用信息 @inproceedings{wadden-etal-2020-fact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550", } ### 贡献致谢感谢[@thomwolf](https://github.com/thomwolf)、[@lhoestq](https://github.com/lhoestq)、[@dwadden](https://github.com/dwadden)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@mariamabarham](https://github.com/mariamabarham)、[@lewtun](https://github.com/lewtun)为本数据集的收录提供支持。

提供机构：

allenai

原始信息汇总

数据集概述

数据集名称

名称: SciFact

语言

语言: 英语 (en)

许可证

许可证: CC BY-NC 2.0

多语言性

多语言性: 单语种

大小分类

大小分类: 1K<n<10K

源数据集

源数据集: 原始数据

任务类别

任务类别: 文本分类

任务ID

任务ID: fact-checking

论文代码ID

论文代码ID: scifact

数据集结构

配置名称

配置名称: corpus 和 claims

数据特征

corpus

doc_id: int32
title: string
abstract: sequence of string
structured: bool

claims

id: int32
claim: string
evidence_doc_id: string
evidence_label: string
evidence_sentences: sequence of int32
cited_doc_ids: sequence of int32

数据分割

corpus

train: 5183 examples, 7993572 bytes

claims

train: 1261 examples, 168627 bytes
validation: 450 examples, 60360 bytes
test: 300 examples, 33625 bytes

下载与数据集大小

下载大小: 3115079 bytes
数据集大小: corpus 7993572 bytes, claims 262612 bytes

数据集创建

许可证信息

许可证: CC BY-NC 2.0

引用信息

@inproceedings{wadden-etal-2020-fact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550", }

搜集汇总

数据集介绍

构建方式

SciFact数据集由专家生成的1.4K科学声明与包含证据的摘要配对构建而成。数据集的构建过程涉及从原始科学文献中提取摘要，并由专家对这些摘要进行标注，以确定其是否支持特定的科学声明。标注过程中，专家不仅提供了支持或不支持的标签，还详细标注了支持声明的具体句子。

特点

SciFact数据集的特点在于其专注于科学事实的验证任务，提供了丰富的科学声明与证据对。数据集中的每个声明都附有详细的证据摘要，并且标注了支持或不支持的标签，以及具体的证据句子。这种结构化的数据形式使得SciFact成为科学事实验证领域的宝贵资源。

使用方法

SciFact数据集可用于训练和评估科学事实验证模型。用户可以通过加载数据集，访问其中的声明和对应的证据摘要，进行模型的训练和测试。数据集提供了训练、验证和测试三个分割，便于用户进行交叉验证和模型性能评估。此外，数据集的结构化格式使得用户可以方便地提取特定字段进行进一步分析。

背景与挑战

背景概述

SciFact数据集由Allen Institute for AI于2020年推出，旨在解决科学文献中的事实核查问题。该数据集包含1400个由专家撰写的科学声明，并配以包含证据的摘要，标注了标签和理由。该数据集的核心研究问题是通过自然语言处理技术验证科学声明的真实性，从而推动科学文献的可信度评估。SciFact的发布为科学事实核查领域提供了重要的基准数据，促进了相关算法和模型的发展。

当前挑战

SciFact数据集面临的挑战主要体现在两个方面。首先，科学文献中的声明通常涉及复杂的专业术语和逻辑推理，如何准确理解并验证这些声明是一个巨大的挑战。其次，数据集的构建过程中，专家标注的准确性和一致性至关重要，但由于科学领域的多样性和复杂性，确保标注的高质量也面临困难。此外，数据集的规模相对较小，可能限制了模型在更广泛场景下的泛化能力。

常用场景

经典使用场景

SciFact数据集在科学文献验证领域具有重要应用，其经典使用场景包括对科学声明进行事实核查。通过将专家撰写的科学声明与包含证据的摘要进行配对，并结合标注的标签和理由，该数据集为自然语言处理模型提供了丰富的训练和测试资源，特别是在文本分类和事实核查任务中表现突出。

实际应用

在实际应用中，SciFact数据集被广泛用于构建自动化科学事实核查系统，帮助科研人员、期刊编辑和科学传播者快速验证科学声明的真实性。此外，该数据集还可用于开发智能文献检索工具，帮助用户在海量科学文献中快速定位相关证据，提升科研效率。

衍生相关工作

基于SciFact数据集，许多经典研究工作得以展开，例如开发基于深度学习的科学声明验证模型、构建科学文献检索系统以及探索多模态科学事实核查方法。这些工作不仅推动了自然语言处理领域的发展，也为科学信息的可信传播提供了技术支撑。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集