gap

Name: gap
Creator: maas
Published: 2025-12-05 16:41:04
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/google-research-datasets/gap

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "gap" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/google-research-datasets/gap-coreference](https://github.com/google-research-datasets/gap-coreference) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns](https://arxiv.org/abs/1810.05201) - **Point of Contact:** [gap-coreference@google.com](mailto:gap-coreference@google.com) - **Size of downloaded dataset files:** 2.40 MB - **Size of the generated dataset:** 2.43 MB - **Total amount of disk used:** 4.83 MB ### Dataset Summary GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia and released by Google AI Language for the evaluation of coreference resolution in practical applications. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 2.40 MB - **Size of the generated dataset:** 2.43 MB - **Total amount of disk used:** 4.83 MB An example of 'validation' looks as follows. ``` { "A": "aliquam ultrices sagittis", "A-coref": false, "A-offset": 208, "B": "elementum curabitur vitae", "B-coref": false, "B-offset": 435, "ID": "validation-1", "Pronoun": "condimentum mattis pellentesque", "Pronoun-offset": 948, "Text": "Lorem ipsum dolor", "URL": "sem fringilla ut" } ``` ### Data Fields The data fields are the same among all splits. #### default - `ID`: a `string` feature. - `Text`: a `string` feature. - `Pronoun`: a `string` feature. - `Pronoun-offset`: a `int32` feature. - `A`: a `string` feature. - `A-offset`: a `int32` feature. - `A-coref`: a `bool` feature. - `B`: a `string` feature. - `B-offset`: a `int32` feature. - `B-coref`: a `bool` feature. - `URL`: a `string` feature. ### Data Splits | name |train|validation|test| |-------|----:|---------:|---:| |default| 2000| 454|2000| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{webster-etal-2018-mind, title = "Mind the {GAP}: A Balanced Corpus of Gendered Ambiguous Pronouns", author = "Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason", journal = "Transactions of the Association for Computational Linguistics", volume = "6", year = "2018", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/Q18-1042", doi = "10.1162/tacl_a_00240", pages = "605--617", } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@patrickvonplaten](https://github.com/patrickvonplaten), [@otakumesi](https://github.com/otakumesi), [@lewtun](https://github.com/lewtun) for adding this dataset.

# 数据集卡片："gap" ## 目录 - [数据集描述](#dataset-description) - [数据集概况](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集策展人](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**：[https://github.com/google-research-datasets/gap-coreference](https://github.com/google-research-datasets/gap-coreference) - **代码仓库**：[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文**：[《留意GAP：平衡的性别歧义代词语料库》](https://arxiv.org/abs/1810.05201) - **联系方式**：[gap-coreference@google.com](mailto:gap-coreference@google.com) - **下载数据集文件大小**：2.40 MB - **生成后数据集大小**：2.43 MB - **总磁盘占用**：4.83 MB ### 数据集概况 GAP是一个性别平衡的数据集，包含8908对经过共指标注的（歧义代词，先行词（antecedent））样本，这些样本从维基百科中采样得到，由Google AI Language发布，用于评估实际应用中的共指消解（coreference resolution）任务。 ### 支持任务与排行榜 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### 默认配置 - **下载数据集文件大小**：2.40 MB - **生成后数据集大小**：2.43 MB - **总磁盘占用**：4.83 MB 验证集的一个示例如下： json { "A": "aliquam ultrices sagittis", "A-coref": false, "A-offset": 208, "B": "elementum curabitur vitae", "B-coref": false, "B-offset": 435, "ID": "validation-1", "Pronoun": "condimentum mattis pellentesque", "Pronoun-offset": 948, "Text": "Lorem ipsum dolor", "URL": "sem fringilla ut" } ### 数据字段所有划分的数据字段均保持一致。 #### 默认配置 - `ID`：字符串类型特征 - `Text`：字符串类型特征 - `Pronoun`：字符串类型特征 - `Pronoun-offset`：int32 类型特征 - `A`：字符串类型特征 - `A-offset`：int32 类型特征 - `A-coref`：布尔类型特征 - `B`：字符串类型特征 - `B-offset`：int32 类型特征 - `B-coref`：布尔类型特征 - `URL`：字符串类型特征 ### 数据划分 | 划分名称 | 训练集 | 验证集 | 测试集 | |---------|-------:|-------:|-------:| | default | 2000 | 454 | 2000 | ## 数据集构建 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与归一化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集策展人 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 引用信息 bibtex @article{webster-etal-2018-mind, title = "Mind the {GAP}: A Balanced Corpus of Gendered Ambiguous Pronouns", author = "Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason", journal = "《计算语言学协会会刊》（Transactions of the Association for Computational Linguistics）", volume = "6", year = "2018", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/Q18-1042", doi = "10.1162/tacl_a_00240", pages = "605--617", } ### 贡献者感谢 [@thomwolf](https://github.com/thomwolf)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@otakumesi](https://github.com/otakumesi)、[@lewtun](https://github.com/lewtun) 为本数据集的收录提供的帮助。

提供机构：

maas

创建时间：

2025-07-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集