allenai/scitail

Name: allenai/scitail
Creator: allenai
Published: 2024-01-04 16:25:10
License: 暂无描述

Hugging Face2024-01-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/allenai/scitail

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en paperswithcode_id: scitail pretty_name: SciTail dataset_info: - config_name: dgem_format features: - name: premise dtype: string - name: hypothesis dtype: string - name: label dtype: string - name: hypothesis_graph_structure dtype: string splits: - name: train num_bytes: 6817626 num_examples: 23088 - name: test num_bytes: 606867 num_examples: 2126 - name: validation num_bytes: 393209 num_examples: 1304 download_size: 2007018 dataset_size: 7817702 - config_name: predictor_format features: - name: answer dtype: string - name: sentence2_structure dtype: string - name: sentence1 dtype: string - name: sentence2 dtype: string - name: gold_label dtype: string - name: question dtype: string splits: - name: train num_bytes: 8864108 num_examples: 23587 - name: test num_bytes: 795275 num_examples: 2126 - name: validation num_bytes: 510140 num_examples: 1304 download_size: 2169238 dataset_size: 10169523 - config_name: snli_format features: - name: sentence1_binary_parse dtype: string - name: sentence1_parse dtype: string - name: sentence1 dtype: string - name: sentence2_parse dtype: string - name: sentence2 dtype: string - name: annotator_labels sequence: string - name: gold_label dtype: string splits: - name: train num_bytes: 22457379 num_examples: 23596 - name: test num_bytes: 2005142 num_examples: 2126 - name: validation num_bytes: 1264378 num_examples: 1304 download_size: 7476483 dataset_size: 25726899 - config_name: tsv_format features: - name: premise dtype: string - name: hypothesis dtype: string - name: label dtype: string splits: - name: train num_bytes: 4606527 num_examples: 23097 - name: test num_bytes: 410267 num_examples: 2126 - name: validation num_bytes: 260422 num_examples: 1304 download_size: 1836546 dataset_size: 5277216 configs: - config_name: dgem_format data_files: - split: train path: dgem_format/train-* - split: test path: dgem_format/test-* - split: validation path: dgem_format/validation-* - config_name: predictor_format data_files: - split: train path: predictor_format/train-* - split: test path: predictor_format/test-* - split: validation path: predictor_format/validation-* - config_name: snli_format data_files: - split: train path: snli_format/train-* - split: test path: snli_format/test-* - split: validation path: snli_format/validation-* - config_name: tsv_format data_files: - split: train path: tsv_format/train-* - split: test path: tsv_format/test-* - split: validation path: tsv_format/validation-* --- # Dataset Card for "scitail" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://allenai.org/data/scitail](https://allenai.org/data/scitail) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 56.70 MB - **Size of the generated dataset:** 49.09 MB - **Total amount of disk used:** 105.79 MB ### Dataset Summary The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples with neutral label ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### dgem_format - **Size of downloaded dataset files:** 14.18 MB - **Size of the generated dataset:** 7.83 MB - **Total amount of disk used:** 22.01 MB An example of 'train' looks as follows. ``` ``` #### predictor_format - **Size of downloaded dataset files:** 14.18 MB - **Size of the generated dataset:** 10.19 MB - **Total amount of disk used:** 24.37 MB An example of 'validation' looks as follows. ``` ``` #### snli_format - **Size of downloaded dataset files:** 14.18 MB - **Size of the generated dataset:** 25.77 MB - **Total amount of disk used:** 39.95 MB An example of 'validation' looks as follows. ``` ``` #### tsv_format - **Size of downloaded dataset files:** 14.18 MB - **Size of the generated dataset:** 5.30 MB - **Total amount of disk used:** 19.46 MB An example of 'validation' looks as follows. ``` ``` ### Data Fields The data fields are the same among all splits. #### dgem_format - `premise`: a `string` feature. - `hypothesis`: a `string` feature. - `label`: a `string` feature. - `hypothesis_graph_structure`: a `string` feature. #### predictor_format - `answer`: a `string` feature. - `sentence2_structure`: a `string` feature. - `sentence1`: a `string` feature. - `sentence2`: a `string` feature. - `gold_label`: a `string` feature. - `question`: a `string` feature. #### snli_format - `sentence1_binary_parse`: a `string` feature. - `sentence1_parse`: a `string` feature. - `sentence1`: a `string` feature. - `sentence2_parse`: a `string` feature. - `sentence2`: a `string` feature. - `annotator_labels`: a `list` of `string` features. - `gold_label`: a `string` feature. #### tsv_format - `premise`: a `string` feature. - `hypothesis`: a `string` feature. - `label`: a `string` feature. ### Data Splits | name |train|validation|test| |----------------|----:|---------:|---:| |dgem_format |23088| 1304|2126| |predictor_format|23587| 1304|2126| |snli_format |23596| 1304|2126| |tsv_format |23097| 1304|2126| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` inproceedings{scitail, Author = {Tushar Khot and Ashish Sabharwal and Peter Clark}, Booktitle = {AAAI}, Title = {{SciTail}: A Textual Entailment Dataset from Science Question Answering}, Year = {2018} } ``` ### Contributions Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun), [@thomwolf](https://github.com/thomwolf) for adding this dataset.

提供机构：

allenai

原始信息汇总

数据集概述

数据集名称: SciTail

数据集ID: scitail

语言: 英语

数据集结构

配置格式

dgem_format
- 特征:
  - premise: 字符串类型
  - hypothesis: 字符串类型
  - label: 字符串类型
  - hypothesis_graph_structure: 字符串类型
- 数据分割:
  - 训练集: 23088个样本，6817626字节
  - 测试集: 2126个样本，606867字节
  - 验证集: 1304个样本，393209字节
- 下载大小: 2007018字节
- 数据集大小: 7817702字节
predictor_format
- 特征:
  - answer: 字符串类型
  - sentence2_structure: 字符串类型
  - sentence1: 字符串类型
  - sentence2: 字符串类型
  - gold_label: 字符串类型
  - question: 字符串类型
- 数据分割:
  - 训练集: 23587个样本，8864108字节
  - 测试集: 2126个样本，795275字节
  - 验证集: 1304个样本，510140字节
- 下载大小: 2169238字节
- 数据集大小: 10169523字节
snli_format
- 特征:
  - sentence1_binary_parse: 字符串类型
  - sentence1_parse: 字符串类型
  - sentence1: 字符串类型
  - sentence2_parse: 字符串类型
  - sentence2: 字符串类型
  - annotator_labels: 字符串序列
  - gold_label: 字符串类型
- 数据分割:
  - 训练集: 23596个样本，22457379字节
  - 测试集: 2126个样本，2005142字节
  - 验证集: 1304个样本，1264378字节
- 下载大小: 7476483字节
- 数据集大小: 25726899字节
tsv_format
- 特征:
  - premise: 字符串类型
  - hypothesis: 字符串类型
  - label: 字符串类型
- 数据分割:
  - 训练集: 23097个样本，4606527字节
  - 测试集: 2126个样本，410267字节
  - 验证集: 1304个样本，260422字节
- 下载大小: 1836546字节
- 数据集大小: 5277216字节

数据文件路径

dgem_format
- 训练集: dgem_format/train-*
- 测试集: dgem_format/test-*
- 验证集: dgem_format/validation-*
predictor_format
- 训练集: predictor_format/train-*
- 测试集: predictor_format/test-*
- 验证集: predictor_format/validation-*
snli_format
- 训练集: snli_format/train-*
- 测试集: snli_format/test-*
- 验证集: snli_format/validation-*
tsv_format
- 训练集: tsv_format/train-*
- 测试集: tsv_format/test-*
- 验证集: tsv_format/validation-*

数据集创建

引用信息:

inproceedings{scitail, Author = {Tushar Khot and Ashish Sabharwal and Peter Clark}, Booktitle = {AAAI}, Title = {{SciTail}: A Textual Entailment Dataset from Science Question Answering}, Year = {2018} }

贡献者:

@patrickvonplaten
@mariamabarham
@lewtun
@thomwolf

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，文本蕴含任务旨在判断前提与假设之间的逻辑关系。SciTail数据集正是为此而构建的专门资源，其构建过程别具匠心。首先，从多项选择科学考试中提取问题与正确答案，将其转化为断言性陈述作为假设。随后，利用信息检索技术从大规模网络语句语料库中获取与假设相关的文本作为前提。最后，通过众包方式对前提-假设对进行标注，将其分类为“支持”（蕴含）或“中性”（不蕴含），从而形成包含27,026个样本的丰富数据集。

特点

SciTail数据集具有鲜明的特点。其一，它聚焦于科学领域，前提与假设均源自科学考试及网络科学文本，确保了内容的专业性与真实性。其二，数据分布均衡中带有挑战性，包含10,101个蕴含样本与16,925个中性样本，有助于模型学习细微的逻辑差异。其三，数据集提供多种格式，如dgem_format、predictor_format、snli_format和tsv_format，每种格式包含不同的字段组合，如前提、假设、标签、句法结构等，极大便利了不同研究需求的适配。

使用方法

使用SciTail数据集时，研究者可根据任务需求灵活选择配置。例如，若需进行文本蕴含分类，可采用tsv_format或dgem_format，直接利用premise、hypothesis和label字段。若需深入分析句法结构，可选用snli_format，其中包含句子解析树和标注者标签。数据划分为训练集、验证集和测试集，各配置下训练集样本量约23,000个，验证集1,304个，测试集2,126个。通过HuggingFace Datasets库即可便捷加载，如使用load_dataset('allenai/scitail', 'tsv_format')命令，快速开启科学文本蕴含研究。

背景与挑战

背景概述

在自然语言理解领域，文本蕴含识别（Textual Entailment）是衡量模型语义推理能力的关键任务，其目标在于判断一段文本（前提）是否逻辑上蕴含另一段文本（假设）。SciTail数据集由艾伦人工智能研究所（Allen Institute for AI）的研究人员Tushar Khot、Ashish Sabharwal和Peter Clark于2018年创建，发表于AAAI会议，旨在填补科学领域文本蕴含数据的空白。该数据集从多选科学考试中提取问题和正确答案，将其转化为断言性假设，并通过信息检索从大规模网络文本中获取相关前提，最终经由众包标注生成蕴含或中性标签。SciTail包含约27,000个样本，其中约10,100个标注为蕴含，16,900个标注为中性，其独特之处在于聚焦科学知识推理，为评估模型在专业领域中的语义理解能力提供了基准，推动了问答系统与教育技术的研究进展。

当前挑战

SciTail数据集所解决的领域问题在于，通用文本蕴含数据集（如SNLI）多基于日常常识，缺乏对科学领域严谨逻辑和专业化表达的覆盖，导致模型在处理涉及科学概念、因果关系及实验结论的推理时表现欠佳。构建过程中面临的挑战包括：将选择题转化为断言性假设时需确保语义一致性，避免歧义；从海量网络文本中检索相关前提需平衡召回率与噪声，防止引入无关信息；众包标注的科学性要求标注者具备基础科学知识，而不同标注者的理解偏差可能影响标签质量。此外，数据集中蕴含与中性类别的分布不均（约37%蕴含），对模型学习鲁棒的特征表示构成隐性挑战，且前提与假设间的词汇重叠可能鼓励模型依赖表面匹配而非深层推理。

常用场景

经典使用场景

在自然语言推理领域，SciTail数据集以其独特的科学问答背景脱颖而出，成为评估模型在科学文本上推理能力的经典基准。该数据集将多项选择题的科学考试题目与正确答案转化为假设，再通过信息检索从大规模网络语料中抽取相关句子作为前提，构建出前提-假设对。研究者常利用SciTail来训练和测试文本蕴含模型，特别是针对科学领域中的支持（entails）与中立（neutral）关系判别，从而衡量模型在复杂科学语境下的语义理解与逻辑推理水平。

实际应用

在实际应用中，SciTail数据集为智能教育系统和科学知识引擎的构建提供了坚实支撑。例如，在自动答题辅导平台中，模型可基于SciTail训练后，精准判断学生提供的答案与科学教材中的知识点是否蕴含一致，从而实时反馈正误。此外，该数据集还助力于科学文献摘要、知识图谱构建以及智能检索系统，帮助系统从海量网络文本中筛选出与科学问题高度相关的支持性证据，提升信息整合与决策的准确性。

衍生相关工作

SciTail数据集的发布催生了多项经典研究工作，尤其在蕴含推理与科学问答的交叉领域。例如，研究者基于SciTail提出结合图结构语义表示的DGEM模型，利用假设的依存句法图增强推理性能；还有工作将SciTail与SNLI等通用数据集联合训练，探索跨领域迁移学习的效果。此外，SciTail也被用于评估大型语言模型（如BERT、RoBERTa）在科学推理上的表现，成为检验预训练模型领域适应性的重要标尺，推动了可解释推理与少样本学习等前沿方向的发展。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集