facebook/anli

Name: facebook/anli
Creator: facebook
Published: 2023-12-21 15:34:02
License: 暂无描述

Hugging Face2023-12-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/facebook/anli

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced - machine-generated language_creators: - found language: - en license: - cc-by-nc-4.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original - extended|hotpot_qa task_categories: - text-classification task_ids: - natural-language-inference - multi-input-text-classification paperswithcode_id: anli pretty_name: Adversarial NLI dataset_info: config_name: plain_text features: - name: uid dtype: string - name: premise dtype: string - name: hypothesis dtype: string - name: label dtype: class_label: names: '0': entailment '1': neutral '2': contradiction - name: reason dtype: string splits: - name: train_r1 num_bytes: 8006888 num_examples: 16946 - name: dev_r1 num_bytes: 573428 num_examples: 1000 - name: test_r1 num_bytes: 574917 num_examples: 1000 - name: train_r2 num_bytes: 20801581 num_examples: 45460 - name: dev_r2 num_bytes: 556066 num_examples: 1000 - name: test_r2 num_bytes: 572639 num_examples: 1000 - name: train_r3 num_bytes: 44720719 num_examples: 100459 - name: dev_r3 num_bytes: 663148 num_examples: 1200 - name: test_r3 num_bytes: 657586 num_examples: 1200 download_size: 26286748 dataset_size: 77126972 configs: - config_name: plain_text data_files: - split: train_r1 path: plain_text/train_r1-* - split: dev_r1 path: plain_text/dev_r1-* - split: test_r1 path: plain_text/test_r1-* - split: train_r2 path: plain_text/train_r2-* - split: dev_r2 path: plain_text/dev_r2-* - split: test_r2 path: plain_text/test_r2-* - split: train_r3 path: plain_text/train_r3-* - split: dev_r3 path: plain_text/dev_r3-* - split: test_r3 path: plain_text/test_r3-* default: true --- # Dataset Card for "anli" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** [https://github.com/facebookresearch/anli/](https://github.com/facebookresearch/anli/) - **Paper:** [Adversarial NLI: A New Benchmark for Natural Language Understanding](https://arxiv.org/abs/1910.14599) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 18.62 MB - **Size of the generated dataset:** 77.12 MB - **Total amount of disk used:** 95.75 MB ### Dataset Summary The Adversarial Natural Language Inference (ANLI) is a new large-scale NLI benchmark dataset, The dataset is collected via an iterative, adversarial human-and-model-in-the-loop procedure. ANLI is much more difficult than its predecessors including SNLI and MNLI. It contains three rounds. Each round has train/dev/test splits. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages English ## Dataset Structure ### Data Instances #### plain_text - **Size of downloaded dataset files:** 18.62 MB - **Size of the generated dataset:** 77.12 MB - **Total amount of disk used:** 95.75 MB An example of 'train_r2' looks as follows. ``` This example was too long and was cropped: { "hypothesis": "Idris Sultan was born in the first month of the year preceding 1994.", "label": 0, "premise": "\"Idris Sultan (born January 1993) is a Tanzanian Actor and comedian, actor and radio host who won the Big Brother Africa-Hotshot...", "reason": "", "uid": "ed5c37ab-77c5-4dbc-ba75-8fd617b19712" } ``` ### Data Fields The data fields are the same among all splits. #### plain_text - `uid`: a `string` feature. - `premise`: a `string` feature. - `hypothesis`: a `string` feature. - `label`: a classification label, with possible values including `entailment` (0), `neutral` (1), `contradiction` (2). - `reason`: a `string` feature. ### Data Splits | name |train_r1|dev_r1|train_r2|dev_r2|train_r3|dev_r3|test_r1|test_r2|test_r3| |----------|-------:|-----:|-------:|-----:|-------:|-----:|------:|------:|------:| |plain_text| 16946| 1000| 45460| 1000| 100459| 1200| 1000| 1000| 1200| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [cc-4 Attribution-NonCommercial](https://github.com/facebookresearch/anli/blob/main/LICENSE) ### Citation Information ``` @InProceedings{nie2019adversarial, title={Adversarial NLI: A New Benchmark for Natural Language Understanding}, author={Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe}, booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", year = "2020", publisher = "Association for Computational Linguistics", } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@easonnie](https://github.com/easonnie), [@lhoestq](https://github.com/lhoestq), [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.

--- annotations_creators: - 众包 - 机器生成 language_creators: - 现有语料采集 language: - 英语 license: - 知识共享署名-非商业性使用4.0国际许可协议（CC BY-NC 4.0） multilinguality: - 单语言 size_categories: - 10万<n<100万 source_datasets: - 原始数据集 - 扩展自HotpotQA task_categories: - 文本分类 task_ids: - 自然语言推理 - 多输入文本分类 paperswithcode_id: anli pretty_name: 对抗性自然语言推理（Adversarial NLI） dataset_info: config_name: 纯文本 features: - name: 唯一标识符（UID） dtype: 字符串 - name: 前提句 dtype: 字符串 - name: 假设句 dtype: 字符串 - name: 标签 dtype: 类别标签: 可选值: '0': 蕴含（entailment） '1': 中立（neutral） '2': 矛盾（contradiction） - name: 推理依据 dtype: 字符串 splits: - name: 第一轮训练集（train_r1）字节数: 8006888 样本数量: 16946 - name: 第一轮验证集（dev_r1）字节数: 573428 样本数量: 1000 - name: 第一轮测试集（test_r1）字节数: 574917 样本数量: 1000 - name: 第二轮训练集（train_r2）字节数: 20801581 样本数量: 45460 - name: 第二轮验证集（dev_r2）字节数: 556066 样本数量: 1000 - name: 第二轮测试集（test_r2）字节数: 572639 样本数量: 1000 - name: 第三轮训练集（train_r3）字节数: 44720719 样本数量: 100459 - name: 第三轮验证集（dev_r3）字节数: 663148 样本数量: 1200 - name: 第三轮测试集（test_r3）字节数: 657586 样本数量: 1200 下载大小: 26286748 生成数据集大小: 77126972 configs: - config_name: 纯文本 data_files: - split: 第一轮训练集（train_r1）路径: 纯文本/train_r1-* - split: 第一轮验证集（dev_r1）路径: 纯文本/dev_r1-* - split: 第一轮测试集（test_r1）路径: 纯文本/test_r1-* - split: 第二轮训练集（train_r2）路径: 纯文本/train_r2-* - split: 第二轮验证集（dev_r2）路径: 纯文本/dev_r2-* - split: 第二轮测试集（test_r2）路径: 纯文本/test_r2-* - split: 第三轮训练集（train_r3）路径: 纯文本/train_r3-* - split: 第三轮验证集（dev_r3）路径: 纯文本/dev_r3-* - split: 第三轮测试集（test_r3）路径: 纯文本/test_r3-* 默认配置: true --- # 数据集卡片："anli" ## 目录 - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与基准排行榜](#支持任务与基准排行榜) - [语言](#语言) - [数据集结构](#数据集结构) - [数据样例](#数据样例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建流程](#数据集构建流程) - [构建初衷](#构建初衷) - [源数据](#源数据) - [标注信息](#标注信息) - [个人与敏感信息](#个人与敏感信息) - [数据集使用须知](#数据集使用须知) - [数据集的社会影响](#数据集的社会影响) - [偏见分析](#偏见分析) - [已知其他局限](#已知其他局限) - [附加信息](#附加信息) - [数据集维护团队](#数据集维护团队) - [授权协议](#授权协议) - [引用信息](#引用信息) - [贡献致谢](#贡献致谢) ## 数据集描述 - **主页:** - **代码仓库:** [https://github.com/facebookresearch/anli/](https://github.com/facebookresearch/anli/) - **相关论文:** [对抗性自然语言推理：自然语言理解新基准](https://arxiv.org/abs/1910.14599) - **联系人:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小:** 18.62 MB - **生成数据集大小:** 77.12 MB - **总磁盘占用:** 95.75 MB ### 数据集概述对抗性自然语言推理（Adversarial NLI，简称ANLI）是一款大规模标准化自然语言推理基准数据集。该数据集通过迭代式人机协同对抗流程采集所得，其难度远超SNLI、MNLI等前代自然语言推理数据集。数据集共包含三轮划分，每一轮均配有训练、验证与测试子集。 ### 支持任务与基准排行榜 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言英语 ## 数据集结构 ### 数据样例 #### 纯文本配置 - **下载数据集文件大小:** 18.62 MB - **生成数据集大小:** 77.12 MB - **总磁盘占用:** 95.75 MB 以下是`train_r2`划分的一条数据样例：该样例过长已被截断： { "hypothesis": "Idris Sultan was born in the first month of the year preceding 1994.", "label": 0, "premise": ""Idris Sultan (born January 1993) is a Tanzanian Actor and comedian, actor and radio host who won the Big Brother Africa-Hotshot...", "reason": "", "uid": "ed5c37ab-77c5-4dbc-ba75-8fd617b19712" } ### 数据字段所有划分的数据字段均保持一致。 #### 纯文本配置 - `uid`：字符串类型特征，即唯一标识符 - `premise`：字符串类型特征，即前提句 - `hypothesis`：字符串类型特征，即假设句 - `label`：分类标签，可选值包括`entailment`（蕴含，对应0）、`neutral`（中立，对应1）、`contradiction`（矛盾，对应2） - `reason`：字符串类型特征，即推理依据 ### 数据划分 | 配置名称 | train_r1 | dev_r1 | train_r2 | dev_r2 | train_r3 | dev_r3 | test_r1 | test_r2 | test_r3 | |----------|---------:|-------:|---------:|-------:|---------:|-------:|--------:|--------:|--------:| | plain_text | 16946 | 1000 | 45460 | 1000 | 100459 | 1200 | 1000 | 1000 | 1200 | ## 数据集构建流程 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据采集与标准化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生成者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注人员是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用须知 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏见分析 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 已知其他局限 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护团队 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 授权协议 [知识共享署名-非商业性使用4.0国际许可协议](https://github.com/facebookresearch/anli/blob/main/LICENSE) ### 引用信息 @InProceedings{nie2019adversarial, title={对抗性自然语言推理：自然语言理解的新基准}, author={Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe}, booktitle = "第58届国际计算语言学协会年会论文集", year = "2020", publisher = "国际计算语言学协会", } ### 贡献致谢感谢[@thomwolf](https://github.com/thomwolf)、[@easonnie](https://github.com/easonnie)、[@lhoestq](https://github.com/lhoestq)、[@patrickvonplaten](https://github.com/patrickvonplaten)为本数据集的收录提供支持。

提供机构：

facebook

原始信息汇总

数据集概述

数据集描述

数据集摘要

Adversarial Natural Language Inference (ANLI) 是一个新的、大规模的自然语言推理（NLI）基准数据集。该数据集通过迭代、对抗性的人机协同过程收集。ANLI 比其前身（如 SNLI 和 MNLI）更具挑战性，包含三轮数据，每轮都有训练集、开发集和测试集。

支持的任务和排行榜

自然语言推理
多输入文本分类

语言

英语

数据集结构

数据实例

一个 train_r2 的示例如下： json { "hypothesis": "Idris Sultan was born in the first month of the year preceding 1994.", "label": 0, "premise": ""Idris Sultan (born January 1993) is a Tanzanian Actor and comedian, actor and radio host who won the Big Brother Africa-Hotshot...", "reason": "", "uid": "ed5c37ab-77c5-4dbc-ba75-8fd617b19712" }

数据字段

所有分割的数据字段相同：

uid: 字符串类型
premise: 字符串类型
hypothesis: 字符串类型
label: 分类标签，可能的值包括 entailment (0), neutral (1), contradiction (2)
reason: 字符串类型

数据分割

名称	train_r1	dev_r1	train_r2	dev_r2	train_r3	dev_r3	test_r1	test_r2	test_r3
plain_text	16946	1000	45460	1000	100459	1200	1000	1000	1200

数据集创建

数据集信息

配置名称: plain_text
特征:
- uid: 字符串类型
- premise: 字符串类型
- hypothesis: 字符串类型
- label: 分类标签，可能的值包括 entailment (0), neutral (1), contradiction (2)
- reason: 字符串类型
分割:
- train_r1: 16946 个样本
- dev_r1: 1000 个样本
- test_r1: 1000 个样本
- train_r2: 45460 个样本
- dev_r2: 1000 个样本
- test_r2: 1000 个样本
- train_r3: 100459 个样本
- dev_r3: 1200 个样本
- test_r3: 1200 个样本
下载大小: 26286748 字节
数据集大小: 77126972 字节

配置

配置名称: plain_text
数据文件:
- train_r1: plain_text/train_r1-*
- dev_r1: plain_text/dev_r1-*
- test_r1: plain_text/test_r1-*
- train_r2: plain_text/train_r2-*
- dev_r2: plain_text/dev_r2-*
- test_r2: plain_text/test_r2-*
- train_r3: plain_text/train_r3-*
- dev_r3: plain_text/dev_r3-*
- test_r3: plain_text/test_r3-*
默认: true

许可证信息

许可证: cc-by-nc-4.0

引用信息

@InProceedings{nie2019adversarial, title={Adversarial NLI: A New Benchmark for Natural Language Understanding}, author={Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe}, booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", year = "2020", publisher = "Association for Computational Linguistics", }

搜集汇总

数据集介绍

构建方式

Adversarial NLI（ANLI）数据集是通过迭代对抗性的人类与模型互动过程收集而成。该数据集基于原始的文本对，包括前提（premise）和假设（hypothesis），并进一步细分为三个轮次，每个轮次都包含训练集、验证集和测试集。数据集的构建旨在提高自然语言理解的难度，以区分模型在自然语言推理任务上的性能。

特点

ANLI数据集的主要特点是它的对抗性。它比之前的SNLI和MNLI数据集更具挑战性，因为它通过迭代的方式引入了人类编写的对抗性示例，以测试模型对于复杂语言现象的理解能力。此外，数据集包含了三个轮次，每个轮次的难度逐渐增加，使得模型必须不断适应新的挑战。数据集采用单语言（英语）形式，并提供了唯一的标签类别，包括蕴含（entailment）、中立（neutral）和矛盾（contradiction）。

使用方法

使用ANLI数据集时，研究者可以按照常规的数据处理流程进行。首先，下载数据集并解压。然后，根据需要选择适当的训练集、验证集或测试集。数据集以JSON格式存储，其中每个实例都包括一个唯一标识符、前提、假设、标签和可选的理由字段。研究者可以利用这些字段来训练、评估或测试自然语言推理模型。

背景与挑战

背景概述

Adversarial Natural Language Inference（ANLI）数据集，由Facebook研究团队创建于2020年，是一项旨在推进自然语言理解领域发展的新的大型基准数据集。该数据集通过迭代、对抗性的人机互动过程收集而成，相较于其前身SNLI和MNLI，ANLI的难度有了显著提升。数据集包含三个阶段，每个阶段都包括训练集、验证集和测试集。ANLI数据集的创建，是为了解决自然语言推理任务中的挑战，其主要研究人员包括Yixin Nie、Adina Williams、Emily Dinan等，该数据集对自然语言处理领域产生了深远的影响，为相关研究提供了重要资源。

当前挑战

ANLI数据集在构建过程中遇到的挑战主要包括：一是如何通过迭代的方式，引入对抗性样本以提升模型的泛化能力；二是确保数据集的多样性和平衡性，避免偏差；三是数据标注的质量控制，确保标注的准确性和一致性。在所解决的领域问题方面，ANLI数据集面临的挑战是如何在自然语言推理任务中，更有效地识别和应对模型可能遇到的复杂和微妙语言现象，从而提高模型在实际应用中的表现。

常用场景

经典使用场景

在自然语言处理领域，Adversarial NLI（ANLI）数据集被广泛用于评估模型在自然语言推理任务上的鲁棒性。该数据集通过对抗性的人类和模型循环迭代过程收集，其难点远超过 predecessors SNLI 和 MNLI，使得它成为检验模型性能的一个重要基准。经典的使用场景包括训练和测试自然语言推理模型，以判断给定的前提和假设之间是否存在蕴含、中立或矛盾关系。

衍生相关工作

基于 ANLI 数据集，学术界衍生出许多相关工作，如改进模型架构以提高对对抗性示例的识别能力，以及研究不同类型偏见在数据集中的表现和缓解策略。这些研究不仅推动了自然语言处理领域的理论发展，也为实际应用提供了更多的解决方案。

数据集最近研究