scitail
收藏魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/scitail
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "scitail"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://allenai.org/data/scitail](https://allenai.org/data/scitail)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 56.70 MB
- **Size of the generated dataset:** 49.09 MB
- **Total amount of disk used:** 105.79 MB
### Dataset Summary
The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question
and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information
retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We
crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create
the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples
with neutral label
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### dgem_format
- **Size of downloaded dataset files:** 14.18 MB
- **Size of the generated dataset:** 7.83 MB
- **Total amount of disk used:** 22.01 MB
An example of 'train' looks as follows.
```
```
#### predictor_format
- **Size of downloaded dataset files:** 14.18 MB
- **Size of the generated dataset:** 10.19 MB
- **Total amount of disk used:** 24.37 MB
An example of 'validation' looks as follows.
```
```
#### snli_format
- **Size of downloaded dataset files:** 14.18 MB
- **Size of the generated dataset:** 25.77 MB
- **Total amount of disk used:** 39.95 MB
An example of 'validation' looks as follows.
```
```
#### tsv_format
- **Size of downloaded dataset files:** 14.18 MB
- **Size of the generated dataset:** 5.30 MB
- **Total amount of disk used:** 19.46 MB
An example of 'validation' looks as follows.
```
```
### Data Fields
The data fields are the same among all splits.
#### dgem_format
- `premise`: a `string` feature.
- `hypothesis`: a `string` feature.
- `label`: a `string` feature.
- `hypothesis_graph_structure`: a `string` feature.
#### predictor_format
- `answer`: a `string` feature.
- `sentence2_structure`: a `string` feature.
- `sentence1`: a `string` feature.
- `sentence2`: a `string` feature.
- `gold_label`: a `string` feature.
- `question`: a `string` feature.
#### snli_format
- `sentence1_binary_parse`: a `string` feature.
- `sentence1_parse`: a `string` feature.
- `sentence1`: a `string` feature.
- `sentence2_parse`: a `string` feature.
- `sentence2`: a `string` feature.
- `annotator_labels`: a `list` of `string` features.
- `gold_label`: a `string` feature.
#### tsv_format
- `premise`: a `string` feature.
- `hypothesis`: a `string` feature.
- `label`: a `string` feature.
### Data Splits
| name |train|validation|test|
|----------------|----:|---------:|---:|
|dgem_format |23088| 1304|2126|
|predictor_format|23587| 1304|2126|
|snli_format |23596| 1304|2126|
|tsv_format |23097| 1304|2126|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Citation Information
```
inproceedings{scitail,
Author = {Tushar Khot and Ashish Sabharwal and Peter Clark},
Booktitle = {AAAI},
Title = {{SciTail}: A Textual Entailment Dataset from Science Question Answering},
Year = {2018}
}
```
### Contributions
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
# "SciTail"数据集卡片
## 目录
- [数据集描述](#数据集描述)
- [数据集概要](#数据集概要)
- [支持任务与排行榜](#支持任务与排行榜)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [标注流程](#标注流程)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可信息](#许可信息)
- [引用信息](#引用信息)
- [贡献致谢](#贡献致谢)
## 数据集描述
- **主页:** [https://allenai.org/data/scitail](https://allenai.org/data/scitail)
- **代码仓库:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **相关论文:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **联系方式:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载的数据集文件大小:** 56.70 MB
- **生成后的数据集大小:** 49.09 MB
- **总磁盘占用:** 105.79 MB
### 数据集概要
SciTail数据集是一款文本蕴含(Textual Entailment)数据集,其构建来源为多项选择科学考试题与网络语句。我们将每道试题及其正确选项转换为陈述性语句,作为假设(hypothesis)。通过信息检索(Information Retrieval)从大规模网络语句语料库中获取相关文本,将其作为前提(premise)P。随后通过众包方式对这些前提-假设对进行标注,标记其是否存在蕴含关系(即支持,entails)或为中立关系(neutral),最终构建得到SciTail数据集。该数据集共包含27026条样本,其中10101条带有蕴含(entails)标签,16925条带有中立(neutral)标签。
### 支持任务与排行榜
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据实例
#### dgem_format
- **下载的数据集文件大小:** 14.18 MB
- **生成后的数据集大小:** 7.83 MB
- **总磁盘占用:** 22.01 MB
训练集的一条样本示例如下:
#### predictor_format
- **下载的数据集文件大小:** 14.18 MB
- **生成后的数据集大小:** 10.19 MB
- **总磁盘占用:** 24.37 MB
验证集的一条样本示例如下:
#### snli_format
- **下载的数据集文件大小:** 14.18 MB
- **生成后的数据集大小:** 25.77 MB
- **总磁盘占用:** 39.95 MB
验证集的一条样本示例如下:
#### tsv_format
- **下载的数据集文件大小:** 14.18 MB
- **生成后的数据集大小:** 5.30 MB
- **总磁盘占用:** 19.46 MB
验证集的一条样本示例如下:
### 数据字段
所有划分的数据字段格式保持一致。
#### dgem_format
- `premise`: 字符串类型特征。
- `hypothesis`: 字符串类型特征。
- `label`: 字符串类型特征。
- `hypothesis_graph_structure`: 字符串类型特征。
#### predictor_format
- `answer`: 字符串类型特征。
- `sentence2_structure`: 字符串类型特征。
- `sentence1`: 字符串类型特征。
- `sentence2`: 字符串类型特征。
- `gold_label`: 字符串类型特征。
- `question`: 字符串类型特征。
#### snli_format
- `sentence1_binary_parse`: 字符串类型特征。
- `sentence1_parse`: 字符串类型特征。
- `sentence1`: 字符串类型特征。
- `sentence2_parse`: 字符串类型特征。
- `sentence2`: 字符串类型特征。
- `annotator_labels`: 字符串类型特征列表。
- `gold_label`: 字符串类型特征。
#### tsv_format
- `premise`: 字符串类型特征。
- `hypothesis`: 字符串类型特征。
- `label`: 字符串类型特征。
### 数据划分
| 名称 |训练集|验证集|测试集|
|----------------|----:|---------:|---:|
|dgem_format |23088| 1304|2126|
|predictor_format|23587| 1304|2126|
|snli_format |23596| 1304|2126|
|tsv_format |23097| 1304|2126|
## 数据集构建
### 构建初衷
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言生产者是谁?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 标注流程
#### 标注过程
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注人员是谁?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集使用注意事项
### 数据集的社会影响
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可信息
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 引用信息
inproceedings{scitail,
Author = {Tushar Khot and Ashish Sabharwal and Peter Clark},
Booktitle = {AAAI},
Title = {{SciTail}: A Textual Entailment Dataset from Science Question Answering},
Year = {2018}
}
### 贡献致谢
感谢[@patrickvonplaten](https://github.com/patrickvonplaten)、[@mariamabarham](https://github.com/mariamabarham)、[@lewtun](https://github.com/lewtun)、[@thomwolf](https://github.com/thomwolf)为本数据集的添加所做出的贡献。
提供机构:
maas
创建时间:
2025-05-27



