google-research-datasets/gap

Name: google-research-datasets/gap
Creator: google-research-datasets
Published: 2024-01-18 11:04:03
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/google-research-datasets/gap

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language: - en language_creators: - found license: - unknown multilinguality: - monolingual pretty_name: GAP Benchmark Suite size_categories: - 1K<n<10K source_datasets: - original task_categories: - token-classification task_ids: - coreference-resolution paperswithcode_id: gap dataset_info: features: - name: ID dtype: string - name: Text dtype: string - name: Pronoun dtype: string - name: Pronoun-offset dtype: int32 - name: A dtype: string - name: A-offset dtype: int32 - name: A-coref dtype: bool - name: B dtype: string - name: B-offset dtype: int32 - name: B-coref dtype: bool - name: URL dtype: string splits: - name: train num_bytes: 1095623 num_examples: 2000 - name: validation num_bytes: 248329 num_examples: 454 - name: test num_bytes: 1090462 num_examples: 2000 download_size: 2401971 dataset_size: 2434414 --- # Dataset Card for "gap" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/google-research-datasets/gap-coreference](https://github.com/google-research-datasets/gap-coreference) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns](https://arxiv.org/abs/1810.05201) - **Point of Contact:** [gap-coreference@google.com](mailto:gap-coreference@google.com) - **Size of downloaded dataset files:** 2.40 MB - **Size of the generated dataset:** 2.43 MB - **Total amount of disk used:** 4.83 MB ### Dataset Summary GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia and released by Google AI Language for the evaluation of coreference resolution in practical applications. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 2.40 MB - **Size of the generated dataset:** 2.43 MB - **Total amount of disk used:** 4.83 MB An example of 'validation' looks as follows. ``` { "A": "aliquam ultrices sagittis", "A-coref": false, "A-offset": 208, "B": "elementum curabitur vitae", "B-coref": false, "B-offset": 435, "ID": "validation-1", "Pronoun": "condimentum mattis pellentesque", "Pronoun-offset": 948, "Text": "Lorem ipsum dolor", "URL": "sem fringilla ut" } ``` ### Data Fields The data fields are the same among all splits. #### default - `ID`: a `string` feature. - `Text`: a `string` feature. - `Pronoun`: a `string` feature. - `Pronoun-offset`: a `int32` feature. - `A`: a `string` feature. - `A-offset`: a `int32` feature. - `A-coref`: a `bool` feature. - `B`: a `string` feature. - `B-offset`: a `int32` feature. - `B-coref`: a `bool` feature. - `URL`: a `string` feature. ### Data Splits | name |train|validation|test| |-------|----:|---------:|---:| |default| 2000| 454|2000| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{webster-etal-2018-mind, title = "Mind the {GAP}: A Balanced Corpus of Gendered Ambiguous Pronouns", author = "Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason", journal = "Transactions of the Association for Computational Linguistics", volume = "6", year = "2018", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/Q18-1042", doi = "10.1162/tacl_a_00240", pages = "605--617", } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@patrickvonplaten](https://github.com/patrickvonplaten), [@otakumesi](https://github.com/otakumesi), [@lewtun](https://github.com/lewtun) for adding this dataset.

annotations_creators: - 众包（crowdsourced） language: - 英语（en） language_creators: - 现有文本采集（found） license: - 未知（unknown） multilinguality: - 单语言（monolingual） pretty_name: GAP基准测试套件（GAP Benchmark Suite） size_categories: - 1000<n<10000 source_datasets: - 原始数据集（original） task_categories: - 词元分类（token-classification） task_ids: - 共指消解（coreference-resolution） paperswithcode_id: gap dataset_info: features: - name: ID dtype: 字符串（string） - name: Text dtype: 字符串（string） - name: Pronoun dtype: 字符串（string） - name: Pronoun-offset dtype: int32 - name: A dtype: 字符串（string） - name: A-offset dtype: int32 - name: A-coref dtype: 布尔值（bool） - name: B dtype: 字符串（string） - name: B-offset dtype: int32 - name: B-coref dtype: 布尔值（bool） - name: URL dtype: 字符串（string） splits: - name: train num_bytes: 1095623 num_examples: 2000 - name: validation num_bytes: 248329 num_examples: 454 - name: test num_bytes: 1090462 num_examples: 2000 download_size: 2401971 dataset_size: 2434414 # "gap"数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准榜单](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [数据整理依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集整理者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**：[https://github.com/google-research-datasets/gap-coreference](https://github.com/google-research-datasets/gap-coreference) - **代码仓库**：[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **论文**：[Mind the GAP：性别歧义代词平衡语料库](https://arxiv.org/abs/1810.05201) - **联系方式**：[gap-coreference@google.com](mailto:gap-coreference@google.com) - **下载数据集大小**：2.40 MB - **生成后数据集大小**：2.43 MB - **总磁盘占用**：4.83 MB ### 数据集概述 GAP是一个性别平衡的数据集，包含8908对经过共指标注的（歧义代词，先行词）样本，从维基百科中采样得到，由Google AI Language发布，用于评估实际应用中的共指消解（coreference resolution）任务。 ### 支持任务与基准榜单 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### 默认配置 - **下载数据集大小**：2.40 MB - **生成后数据集大小**：2.43 MB - **总磁盘占用**：4.83 MB 验证集的一个示例如下： { "A": "aliquam ultrices sagittis", "A-coref": false, "A-offset": 208, "B": "elementum curabitur vitae", "B-coref": false, "B-offset": 435, "ID": "validation-1", "Pronoun": "condimentum mattis pellentesque", "Pronoun-offset": 948, "Text": "Lorem ipsum dolor", "URL": "sem fringilla ut" } ### 数据字段所有划分的数据字段均保持一致。 #### 默认配置 - `ID`：字符串（string）类型特征。 - `Text`：字符串（string）类型特征。 - `Pronoun`：字符串（string）类型特征。 - `Pronoun-offset`：int32类型特征。 - `A`：字符串（string）类型特征。 - `A-offset`：int32类型特征。 - `A-coref`：布尔值（bool）类型特征。 - `B`：字符串（string）类型特征。 - `B-offset`：int32类型特征。 - `B-coref`：布尔值（bool）类型特征。 - `URL`：字符串（string）类型特征。 ### 数据划分 | 划分名称 | 训练集样本数 | 验证集样本数 | 测试集样本数 | |-------|----:|---------:|---:| |默认配置| 2000| 454|2000| ## 数据集构建 ### 数据整理依据 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生成者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集整理者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 引用信息 @article{webster-etal-2018-mind, title = "Mind the GAP：性别歧义代词平衡语料库", author = "Webster, Kellie 和 Recasens, Marta 和 Axelrod, Vera 和 Baldridge, Jason", journal = "《计算语言学协会会刊》（Transactions of the Association for Computational Linguistics）", volume = "6", year = "2018", address = "美国马萨诸塞州剑桥市", publisher = "MIT出版社", url = "https://aclanthology.org/Q18-1042", doi = "10.1162/tacl_a_00240", pages = "605--617", } ### 贡献者感谢[@thomwolf](https://github.com/thomwolf)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@otakumesi](https://github.com/otakumesi)、[@lewtun](https://github.com/lewtun) 为本数据集的收录提供支持。

提供机构：

google-research-datasets

原始信息汇总

数据集卡片 for "gap"

数据集描述

数据集摘要

GAP是一个性别平衡的数据集，包含8,908个核心引用标记的(模糊代词, 先行词名称)对，从维基百科中采样并由Google AI Language发布，用于评估实际应用中的核心引用解析。

支持的任务和排行榜

更多信息需补充

语言

更多信息需补充

数据集结构

数据实例

默认

下载的数据集文件大小: 2.40 MB
生成的数据集大小: 2.43 MB
磁盘总使用量: 4.83 MB

一个validation的例子如下： json { "A": "aliquam ultrices sagittis", "A-coref": false, "A-offset": 208, "B": "elementum curabitur vitae", "B-coref": false, "B-offset": 435, "ID": "validation-1", "Pronoun": "condimentum mattis pellentesque", "Pronoun-offset": 948, "Text": "Lorem ipsum dolor", "URL": "sem fringilla ut" }

数据字段

所有分割的数据字段相同。

默认

ID: 一个string特征。
Text: 一个string特征。
Pronoun: 一个string特征。
Pronoun-offset: 一个int32特征。
A: 一个string特征。
A-offset: 一个int32特征。
A-coref: 一个bool特征。
B: 一个string特征。
B-offset: 一个int32特征。
B-coref: 一个bool特征。
URL: 一个string特征。

数据分割

名称	训练	验证	测试
默认	2000	454	2000

数据集创建

策划理由

更多信息需补充

源数据

初始数据收集和规范化

更多信息需补充

源语言生产者是谁？

更多信息需补充

注释

使用数据的注意事项

数据集的社会影响

更多信息需补充

偏见的讨论

更多信息需补充

其他已知限制

更多信息需补充

附加信息

数据集策展人

更多信息需补充

许可信息

更多信息需补充

引用信息

plaintext @article{webster-etal-2018-mind, title = "Mind the {GAP}: A Balanced Corpus of Gendered Ambiguous Pronouns", author = "Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason", journal = "Transactions of the Association for Computational Linguistics", volume = "6", year = "2018", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/Q18-1042", doi = "10.1162/tacl_a_00240", pages = "605--617", }

贡献

感谢@thomwolf, @patrickvonplaten, @otakumesi, @lewtun 添加此数据集。

搜集汇总

数据集介绍

构建方式

GAP数据集的构建以平衡性别为核心，通过从Wikipedia中采样并标注包含模糊代词及其先行词的对，旨在为评估实际应用中的指代消解性能提供基准。数据集包含了8,908个经过标注的代词-先行词对，分为训练集、验证集和测试集三个部分。

特点

GAP数据集的特点在于其性别平衡性，涵盖了不同性别指代的模糊代词，并且数据来源于Wikipedia，保证了其多样性和实用性。每个数据实例包含代词、先行词、文本片段、偏移量、指代关系等信息，适用于指代消解任务的评估。

使用方法

使用GAP数据集时，用户可以从HuggingFace的dataset库中直接加载。数据集以JSON格式存储，包含ID、文本、代词、偏移量、先行词及其指代关系等字段。用户可以根据自己的需要，对数据集进行相应的预处理和后处理，以便用于模型训练和评估。

背景与挑战

背景概述

GAP（Gendered Ambiguous Pronouns）数据集，全称为Mind the GAP，是由Google AI Language团队于2018年发布的一个性别平衡的核心ference resolution数据集。该数据集主要研究人员包括Kellie Webster、Marta Recasens、Vera Axelrod和Jason Baldridge等，旨在解决实际应用中核心ference resolution的性别偏见问题，提供了8,908对经过标注的（模糊代词，先行词）样本，这些样本均从Wikipedia中采集。GAP数据集的发布对自然语言处理领域，尤其是核心ference resolution任务产生了显著影响，为相关研究提供了重要的资源和基准。

当前挑战

GAP数据集在构建过程中遇到的挑战主要包括：确保数据的性别平衡，避免引入性别偏见；以及数据标注的质量控制，确保标注的准确性和一致性。在所解决的领域问题方面，GAP数据集面临的挑战包括如何提高核心ference resolution系统在处理模糊代词时的性别中立性和准确性，以及如何有效处理和评价性别偏见问题。

常用场景

经典使用场景

在自然语言处理领域，GAP Benchmark Suite数据集以其独特的性别平衡标注对，被广泛应用于评估核心ference resolution的性能。该数据集提供了8,908个经过标注的（模糊代词，先行词）对，源自维基百科，旨在帮助模型准确识别和解析文本中的代词与其所指的名词之间的关系。

解决学术问题

GAP数据集解决了传统核心ference resolution数据集中存在的性别偏见问题，为学术研究提供了一个平衡的、考虑性别因素的数据集。通过使用GAP，研究者能够更准确地评估和改进核心ference resolution系统，尤其是在处理性别模糊的代词时，这对于提升自然语言理解系统的整体性能具有重要意义。

衍生相关工作

GAP数据集的发布促进了相关领域的研究，衍生出了一系列经典工作。例如，研究人员利用GAP数据集来训练和测试新的核心ference resolution模型，以及进行性别偏见在自然语言处理中的影响评估，推动了该领域的理论和技术进步。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集

google-research-datasets/gap

数据集卡片 for "gap"

数据集描述

数据集摘要

支持的任务和排行榜

语言

数据集结构

数据实例

默认

数据字段

默认

数据分割

数据集创建

策划理由

源数据

初始数据收集和规范化

源语言生产者是谁？

注释

注释过程

注释者是谁？

个人和敏感信息

使用数据的注意事项

数据集的社会影响

偏见的讨论

其他已知限制

附加信息

数据集策展人

许可信息

引用信息

贡献