allenai/scifact
收藏Hugging Face2023-12-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/allenai/scifact
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language:
- en
language_creators:
- found
license:
- cc-by-nc-2.0
multilinguality:
- monolingual
pretty_name: SciFact
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- fact-checking
paperswithcode_id: scifact
dataset_info:
- config_name: corpus
features:
- name: doc_id
dtype: int32
- name: title
dtype: string
- name: abstract
sequence: string
- name: structured
dtype: bool
splits:
- name: train
num_bytes: 7993572
num_examples: 5183
download_size: 3115079
dataset_size: 7993572
- config_name: claims
features:
- name: id
dtype: int32
- name: claim
dtype: string
- name: evidence_doc_id
dtype: string
- name: evidence_label
dtype: string
- name: evidence_sentences
sequence: int32
- name: cited_doc_ids
sequence: int32
splits:
- name: train
num_bytes: 168627
num_examples: 1261
- name: test
num_bytes: 33625
num_examples: 300
- name: validation
num_bytes: 60360
num_examples: 450
download_size: 3115079
dataset_size: 262612
---
# Dataset Card for "scifact"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://scifact.apps.allenai.org/](https://scifact.apps.allenai.org/)
- **Repository:** https://github.com/allenai/scifact
- **Paper:** [Fact or Fiction: Verifying Scientific Claims](https://aclanthology.org/2020.emnlp-main.609/)
- **Point of Contact:** [David Wadden](mailto:davidw@allenai.org)
- **Size of downloaded dataset files:** 6.23 MB
- **Size of the generated dataset:** 8.26 MB
- **Total amount of disk used:** 14.49 MB
### Dataset Summary
SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales.
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### claims
- **Size of downloaded dataset files:** 3.12 MB
- **Size of the generated dataset:** 262.61 kB
- **Total amount of disk used:** 3.38 MB
An example of 'validation' looks as follows.
```
{
"cited_doc_ids": [14717500],
"claim": "1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.",
"evidence_doc_id": "14717500",
"evidence_label": "SUPPORT",
"evidence_sentences": [2, 5],
"id": 3
}
```
#### corpus
- **Size of downloaded dataset files:** 3.12 MB
- **Size of the generated dataset:** 7.99 MB
- **Total amount of disk used:** 11.11 MB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"abstract": "[\"Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and res...",
"doc_id": 4983,
"structured": false,
"title": "Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging."
}
```
### Data Fields
The data fields are the same among all splits.
#### claims
- `id`: a `int32` feature.
- `claim`: a `string` feature.
- `evidence_doc_id`: a `string` feature.
- `evidence_label`: a `string` feature.
- `evidence_sentences`: a `list` of `int32` features.
- `cited_doc_ids`: a `list` of `int32` features.
#### corpus
- `doc_id`: a `int32` feature.
- `title`: a `string` feature.
- `abstract`: a `list` of `string` features.
- `structured`: a `bool` feature.
### Data Splits
#### claims
| |train|validation|test|
|------|----:|---------:|---:|
|claims| 1261| 450| 300|
#### corpus
| |train|
|------|----:|
|corpus| 5183|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
https://github.com/allenai/scifact/blob/master/LICENSE.md
The SciFact dataset is released under the [CC BY-NC 2.0](https://creativecommons.org/licenses/by-nc/2.0/). By using the SciFact data, you are agreeing to its usage terms.
### Citation Information
```
@inproceedings{wadden-etal-2020-fact,
title = "Fact or Fiction: Verifying Scientific Claims",
author = "Wadden, David and
Lin, Shanchuan and
Lo, Kyle and
Wang, Lucy Lu and
van Zuylen, Madeleine and
Cohan, Arman and
Hajishirzi, Hannaneh",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.emnlp-main.609",
doi = "10.18653/v1/2020.emnlp-main.609",
pages = "7534--7550",
}
```
### Contributions
Thanks to [@thomwolf](https://github.com/thomwolf), [@lhoestq](https://github.com/lhoestq), [@dwadden](https://github.com/dwadden), [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun) for adding this dataset.
---
注释创建者:
- 专家生成
语言:
- 英语
语言创建者:
- 公开获取
许可证:
- 知识共享署名-非商业性使用2.0协议(CC BY-NC 2.0)
多语言属性:
- 单语言
数据集名称:SciFact
样本量范畴:
- 1000 < 样本数 < 10000
源数据集:
- 原生数据集
任务类别:
- 文本分类(text-classification)
任务子项:
- 事实核查(fact-checking)
PapersWithCode编号:scifact
数据集信息:
- 配置名称:语料库(corpus)
特征字段:
- 字段名:doc_id,数据类型:int32(32位整数)
- 字段名:title,数据类型:string(字符串)
- 字段名:abstract,数据类型:字符串序列
- 字段名:structured,数据类型:bool(布尔值)
数据集划分:
- 划分名称:训练集(train),字节数:7993572,样本数:5183
下载大小:3115079,生成后数据集大小:7993572
- 配置名称:主张集(claims)
特征字段:
- 字段名:id,数据类型:int32
- 字段名:claim,数据类型:string
- 字段名:evidence_doc_id,数据类型:string
- 字段名:evidence_label,数据类型:string
- 字段名:evidence_sentences,数据类型:int32序列
- 字段名:cited_doc_ids,数据类型:int32序列
数据集划分:
- 划分名称:训练集(train),字节数:168627,样本数:1261
- 划分名称:测试集(test),字节数:33625,样本数:300
- 划分名称:验证集(validation),字节数:60360,样本数:450
下载大小:3115079,生成后数据集大小:262612
---
# "SciFact"数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与基准测试榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据样例](#data-instances)
- [数据字段](#data-fields)
- [数据集划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [注释标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页:** [https://scifact.apps.allenai.org/](https://scifact.apps.allenai.org/)
- **代码仓库:** https://github.com/allenai/scifact
- **相关论文:** [《Fact or Fiction: Verifying Scientific Claims》](https://aclanthology.org/2020.emnlp-main.609/)
- **联系人:** [David Wadden](mailto:davidw@allenai.org)
- **下载数据集文件大小:** 6.23 MB
- **生成后数据集大小:** 8.26 MB
- **总磁盘占用空间:** 14.49 MB
### 数据集概述
SciFact是一类涵盖1400条专家撰写的科学主张的数据集,每条主张均搭配包含证据的学术摘要,并附带标注标签与论证依据。
### 支持任务与基准测试榜
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据样例
#### 主张集(claims)
- **下载数据集文件大小:** 3.12 MB
- **生成后数据集大小:** 262.61 kB
- **总磁盘占用空间:** 3.38 MB
以下为验证集的一条样例:
{
"cited_doc_ids": [14717500],
"claim": "1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.",
"evidence_doc_id": "14717500",
"evidence_label": "SUPPORT",
"evidence_sentences": [2, 5],
"id": 3
}
#### 语料库(corpus)
- **下载数据集文件大小:** 3.12 MB
- **生成后数据集大小:** 7.99 MB
- **总磁盘占用空间:** 11.11 MB
以下为训练集的一条样例:
This example was too long and was cropped:
{
"abstract": "["Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and res...",
"doc_id": 4983,
"structured": false,
"title": "Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging."
}
### 数据字段
所有数据集划分的字段均保持一致。
#### 主张集(claims)
- `id`:32位整数类型特征。
- `claim`:字符串类型特征,即科学主张文本。
- `evidence_doc_id`:字符串类型特征,即证据文档编号。
- `evidence_label`:字符串类型特征,即证据标注标签。
- `evidence_sentences`:32位整数列表类型特征,即证据所在句子的索引。
- `cited_doc_ids`:32位整数列表类型特征,即被引用的文档编号列表。
#### 语料库(corpus)
- `doc_id`:32位整数类型特征,即文档编号。
- `title`:字符串类型特征,即文档标题。
- `abstract`:字符串列表类型特征,即文档摘要的分句内容。
- `structured`:布尔类型特征,即文档是否为结构化摘要。
### 数据集划分
#### 主张集(claims)
| | 训练集 | 验证集 | 测试集 |
|------|-------:|-------:|-------:|
| 主张集 | 1261 | 450 | 300 |
#### 语料库(corpus)
| | 训练集 |
|------|-------:|
| 语料库 | 5183 |
## 数据集构建
### 构建初衷
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言生产者是谁?
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 注释标注
#### 标注流程
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注者是谁?
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可信息
https://github.com/allenai/scifact/blob/master/LICENSE.md
SciFact数据集采用[知识共享署名-非商业性使用2.0协议(CC BY-NC 2.0)](https://creativecommons.org/licenses/by-nc/2.0/)进行发布。使用SciFact数据集即代表您同意其使用条款。
### 引用信息
@inproceedings{wadden-etal-2020-fact,
title = "Fact or Fiction: Verifying Scientific Claims",
author = "Wadden, David and
Lin, Shanchuan and
Lo, Kyle and
Wang, Lucy Lu and
van Zuylen, Madeleine and
Cohan, Arman and
Hajishirzi, Hannaneh",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.emnlp-main.609",
doi = "10.18653/v1/2020.emnlp-main.609",
pages = "7534--7550",
}
### 贡献致谢
感谢[@thomwolf](https://github.com/thomwolf)、[@lhoestq](https://github.com/lhoestq)、[@dwadden](https://github.com/dwadden)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@mariamabarham](https://github.com/mariamabarham)、[@lewtun](https://github.com/lewtun)为本数据集的收录提供支持。
提供机构:
allenai
原始信息汇总
数据集概述
数据集名称
- 名称: SciFact
语言
- 语言: 英语 (en)
许可证
- 许可证: CC BY-NC 2.0
多语言性
- 多语言性: 单语种
大小分类
- 大小分类: 1K<n<10K
源数据集
- 源数据集: 原始数据
任务类别
- 任务类别: 文本分类
任务ID
- 任务ID: fact-checking
论文代码ID
- 论文代码ID: scifact
数据集结构
配置名称
- 配置名称: corpus 和 claims
数据特征
corpus
- doc_id: int32
- title: string
- abstract: sequence of string
- structured: bool
claims
- id: int32
- claim: string
- evidence_doc_id: string
- evidence_label: string
- evidence_sentences: sequence of int32
- cited_doc_ids: sequence of int32
数据分割
corpus
- train: 5183 examples, 7993572 bytes
claims
- train: 1261 examples, 168627 bytes
- validation: 450 examples, 60360 bytes
- test: 300 examples, 33625 bytes
下载与数据集大小
- 下载大小: 3115079 bytes
- 数据集大小: corpus 7993572 bytes, claims 262612 bytes
数据集创建
许可证信息
- 许可证: CC BY-NC 2.0
引用信息
@inproceedings{wadden-etal-2020-fact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550", }
搜集汇总
数据集介绍

构建方式
SciFact数据集由专家生成的1.4K科学声明与包含证据的摘要配对构建而成。数据集的构建过程涉及从原始科学文献中提取摘要,并由专家对这些摘要进行标注,以确定其是否支持特定的科学声明。标注过程中,专家不仅提供了支持或不支持的标签,还详细标注了支持声明的具体句子。
特点
SciFact数据集的特点在于其专注于科学事实的验证任务,提供了丰富的科学声明与证据对。数据集中的每个声明都附有详细的证据摘要,并且标注了支持或不支持的标签,以及具体的证据句子。这种结构化的数据形式使得SciFact成为科学事实验证领域的宝贵资源。
使用方法
SciFact数据集可用于训练和评估科学事实验证模型。用户可以通过加载数据集,访问其中的声明和对应的证据摘要,进行模型的训练和测试。数据集提供了训练、验证和测试三个分割,便于用户进行交叉验证和模型性能评估。此外,数据集的结构化格式使得用户可以方便地提取特定字段进行进一步分析。
背景与挑战
背景概述
SciFact数据集由Allen Institute for AI于2020年推出,旨在解决科学文献中的事实核查问题。该数据集包含1400个由专家撰写的科学声明,并配以包含证据的摘要,标注了标签和理由。该数据集的核心研究问题是通过自然语言处理技术验证科学声明的真实性,从而推动科学文献的可信度评估。SciFact的发布为科学事实核查领域提供了重要的基准数据,促进了相关算法和模型的发展。
当前挑战
SciFact数据集面临的挑战主要体现在两个方面。首先,科学文献中的声明通常涉及复杂的专业术语和逻辑推理,如何准确理解并验证这些声明是一个巨大的挑战。其次,数据集的构建过程中,专家标注的准确性和一致性至关重要,但由于科学领域的多样性和复杂性,确保标注的高质量也面临困难。此外,数据集的规模相对较小,可能限制了模型在更广泛场景下的泛化能力。
常用场景
经典使用场景
SciFact数据集在科学文献验证领域具有重要应用,其经典使用场景包括对科学声明进行事实核查。通过将专家撰写的科学声明与包含证据的摘要进行配对,并结合标注的标签和理由,该数据集为自然语言处理模型提供了丰富的训练和测试资源,特别是在文本分类和事实核查任务中表现突出。
实际应用
在实际应用中,SciFact数据集被广泛用于构建自动化科学事实核查系统,帮助科研人员、期刊编辑和科学传播者快速验证科学声明的真实性。此外,该数据集还可用于开发智能文献检索工具,帮助用户在海量科学文献中快速定位相关证据,提升科研效率。
衍生相关工作
基于SciFact数据集,许多经典研究工作得以展开,例如开发基于深度学习的科学声明验证模型、构建科学文献检索系统以及探索多模态科学事实核查方法。这些工作不仅推动了自然语言处理领域的发展,也为科学信息的可信传播提供了技术支撑。
以上内容由遇见数据集搜集并总结生成



