coref-data/gap_raw

Name: coref-data/gap_raw
Creator: coref-data
Published: 2024-01-19 00:03:40
License: 暂无描述

Hugging Face2024-01-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/coref-data/gap_raw

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- # Dataset Card for "gap" ## Dataset Description - **Homepage:** [https://github.com/google-research-datasets/gap-coreference](https://github.com/google-research-datasets/gap-coreference) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns](https://arxiv.org/abs/1810.05201) - **Point of Contact:** [gap-coreference@google.com](mailto:gap-coreference@google.com) - **Size of downloaded dataset files:** 2.40 MB - **Size of the generated dataset:** 2.43 MB - **Total amount of disk used:** 4.83 MB ### Dataset Summary GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia and released by Google AI Language for the evaluation of coreference resolution in practical applications. ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 2.40 MB - **Size of the generated dataset:** 2.43 MB - **Total amount of disk used:** 4.83 MB An example of 'validation' looks as follows. ``` { "A": "aliquam ultrices sagittis", "A-coref": false, "A-offset": 208, "B": "elementum curabitur vitae", "B-coref": false, "B-offset": 435, "ID": "validation-1", "Pronoun": "condimentum mattis pellentesque", "Pronoun-offset": 948, "Text": "Lorem ipsum dolor", "URL": "sem fringilla ut" } ``` ### Data Fields The data fields are the same among all splits. #### default - `ID`: a `string` feature. - `Text`: a `string` feature. - `Pronoun`: a `string` feature. - `Pronoun-offset`: a `int32` feature. - `A`: a `string` feature. - `A-offset`: a `int32` feature. - `A-coref`: a `bool` feature. - `B`: a `string` feature. - `B-offset`: a `int32` feature. - `B-coref`: a `bool` feature. - `URL`: a `string` feature. ### Data Splits | name |train|validation|test| |-------|----:|---------:|---:| |default| 2000| 454|2000| ### Citation Information ``` @article{webster-etal-2018-mind, title = "Mind the {GAP}: A Balanced Corpus of Gendered Ambiguous Pronouns", author = "Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason", journal = "Transactions of the Association for Computational Linguistics", volume = "6", year = "2018", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/Q18-1042", doi = "10.1162/tacl_a_00240", pages = "605--617", } ``` ### Contributions Modified from dataset added by [@thomwolf](https://github.com/thomwolf), [@patrickvonplaten](https://github.com/patrickvonplaten), [@otakumesi](https://github.com/otakumesi), [@lewtun](https://github.com/lewtun)

--- 许可证：Apache-2.0 --- # 「GAP（Gendered Ambiguous Pronouns）」数据集卡片 ## 数据集描述 - **主页**：[https://github.com/google-research-datasets/gap-coreference](https://github.com/google-research-datasets/gap-coreference) - **仓库**：[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文**：[Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns](https://arxiv.org/abs/1810.05201) - **联系人**：[gap-coreference@google.com](mailto:gap-coreference@google.com) - **下载数据集文件大小**：2.40 MB - **生成数据集大小**：2.43 MB - **总磁盘占用量**：4.83 MB ### 数据集概述 GAP是一个性别平衡的数据集，包含8908组经过共指标注的（歧义代词，先行词名称）对，样本取自维基百科，由Google AI Language发布，用于实际应用场景下的共指消解模型评估。 ## 数据集结构 ### 数据实例 #### 默认拆分 - **下载数据集文件大小**：2.40 MB - **生成数据集大小**：2.43 MB - **总磁盘占用量**：4.83 MB `validation` 拆分的示例如下： { "A": "aliquam ultrices sagittis", "A-coref": false, "A-offset": 208, "B": "elementum curabitur vitae", "B-coref": false, "B-offset": 435, "ID": "validation-1", "Pronoun": "condimentum mattis pellentesque", "Pronoun-offset": 948, "Text": "Lorem ipsum dolor", "URL": "sem fringilla ut" } ### 数据字段所有拆分下的数据字段均保持一致。 #### 默认拆分 - `ID`：字符串（string）类型特征 - `Text`：字符串类型特征 - `Pronoun`：字符串类型特征 - `Pronoun-offset`：int32 整型特征 - `A`：字符串类型特征 - `A-offset`：int32 整型特征 - `A-coref`：布尔（bool）类型特征 - `B`：字符串类型特征 - `B-offset`：int32 整型特征 - `B-coref`：布尔类型特征 - `URL`：字符串类型特征 ### 数据拆分 | 拆分名称 | 训练集 | 验证集 | 测试集 | |---------|-------:|-------:|------:| | default | 2000 | 454 | 2000 | ### 引用信息 bibtex @article{webster-etal-2018-mind, title = "Mind the {GAP}: A Balanced Corpus of Gendered Ambiguous Pronouns", author = "Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason", journal = "Transactions of the Association for Computational Linguistics", volume = "6", year = "2018", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/Q18-1042", doi = "10.1162/tacl_a_00240", pages = "605--617", } ### 贡献说明本数据集卡片修改自由[@thomwolf](https://github.com/thomwolf)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@otakumesi](https://github.com/otakumesi)、[@lewtun](https://github.com/lewtun) 贡献的原始数据集卡片。

提供机构：

coref-data

原始信息汇总

数据集卡片 for "gap"

数据集描述

主页: https://github.com/google-research-datasets/gap-coreference
论文: Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns
联系人: gap-coreference@google.com
下载的数据集文件大小: 2.40 MB
生成的数据集大小: 2.43 MB
总磁盘使用量: 4.83 MB

数据集概述

GAP 是一个性别平衡的数据集，包含 8,908 个共指标记的 (模糊代词, 先行词名称) 对，从维基百科中采样并由 Google AI Language 发布，用于评估实际应用中的共指消解。

数据集结构

数据实例

默认

下载的数据集文件大小: 2.40 MB
生成的数据集大小: 2.43 MB
总磁盘使用量: 4.83 MB

一个 validation 示例如下： json { "A": "aliquam ultrices sagittis", "A-coref": false, "A-offset": 208, "B": "elementum curabitur vitae", "B-coref": false, "B-offset": 435, "ID": "validation-1", "Pronoun": "condimentum mattis pellentesque", "Pronoun-offset": 948, "Text": "Lorem ipsum dolor", "URL": "sem fringilla ut" }

数据字段

所有拆分中的数据字段相同。

默认

ID: 一个 string 特征。
Text: 一个 string 特征。
Pronoun: 一个 string 特征。
Pronoun-offset: 一个 int32 特征。
A: 一个 string 特征。
A-offset: 一个 int32 特征。
A-coref: 一个 bool 特征。
B: 一个 string 特征。
B-offset: 一个 int32 特征。
B-coref: 一个 bool 特征。
URL: 一个 string 特征。

数据拆分

名称	训练	验证	测试
默认	2000	454	2000

引用信息

plaintext @article{webster-etal-2018-mind, title = "Mind the {GAP}: A Balanced Corpus of Gendered Ambiguous Pronouns", author = "Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason", journal = "Transactions of the Association for Computational Linguistics", volume = "6", year = "2018", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/Q18-1042", doi = "10.1162/tacl_a_00240", pages = "605--617", }

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，指代消解任务旨在识别文本中代词所指代的实体。GAP数据集的构建过程体现了严谨的学术设计，其语料均从维基百科中采样，确保了文本的多样性与真实性。研究团队通过人工标注的方式，为每个样本中的模糊代词标注了两个可能的先行词候选（A与B），并精确标记了它们在原文中的位置偏移量，同时以布尔值明确标注了代词与哪个候选词构成共指关系。这种构建方法生成了一个包含8,908个标注对的平衡语料库，为模型评估提供了高精度的基准。

特点

该数据集的核心特征在于其性别平衡性，设计时特意确保了男女人名作为先行词的比例均衡，从而能够有效评估模型在处理性别模糊代词时的公平性与准确性，避免算法偏见。每个数据实例结构清晰，包含了原始文本、代词、两个候选实体及其在文中的精确位置信息，以及共指关系的真值标签。数据集被划分为训练集、验证集和测试集，规模分别为2000、454和2000个样本，为模型的训练与稳健评估提供了合理的数据支撑。

使用方法

使用GAP数据集时，研究人员通常将其作为指代消解模型的基准测试集。通过加载数据集，可以访问‘Text’、‘Pronoun’、‘A’、‘B’等字段，并利用‘A-coref’与‘B-coref’标签作为监督信号进行模型训练或性能评估。模型的任务是依据上下文，判断给定的代词是与候选A还是候选B存在共指关系，或两者皆非。该数据集已集成于Hugging Face平台，支持通过`datasets`库便捷加载，其标准化的格式便于直接嵌入现有的自然语言处理流程，推动指代消解技术的进步。

背景与挑战

背景概述

在自然语言处理领域，指代消解是理解文本语义关联的核心任务之一，旨在确定代词所指代的实体。2018年，Google AI Language团队发布了GAP数据集，由Kellie Webster、Marta Recasens等研究人员共同构建，旨在解决性别平衡的代词指代消解问题。该数据集从维基百科中采样，包含8,908个标注的（模糊代词，先行词名称）对，其设计初衷是为了评估实际应用中的指代消解模型性能，尤其关注性别偏见的影响。GAP数据集的推出，为指代消解研究提供了性别平衡的基准，促进了公平性评估在自然语言处理中的发展，对后续研究产生了深远影响。

当前挑战

GAP数据集所解决的领域挑战在于指代消解中性别偏见的识别与缓解，传统模型在处理模糊代词时往往受训练数据偏差影响，导致对不同性别实体的指代错误率不均。构建过程中的挑战包括从真实文本中提取性别平衡的样本，确保数据标注的准确性与一致性，以及设计合理的评估指标以量化模型在性别公平性上的表现。这些挑战要求数据集在采样、标注和验证环节进行精细设计，以支撑可靠的研究分析。

常用场景

经典使用场景

在自然语言处理领域，指代消解是理解文本语义关联的核心任务之一。GAP数据集以其性别平衡的特性，为评估指代消解模型在真实语境中的性能提供了经典基准。该数据集通过从维基百科中采样，构建了包含模糊代词与候选先行词对的标注样本，使研究者能够系统测试模型在识别代词所指对象时的准确性与鲁棒性，尤其关注性别偏见对模型判断的影响。

实际应用

在实际应用中，GAP数据集被广泛用于优化智能助手、机器翻译和文本摘要等系统的核心理解模块。例如，在对话系统中，准确解析代词所指能显著提升交互的连贯性与自然度；在信息提取任务中，它有助于更精确地关联实体与事件，从而增强知识图谱的构建质量。这些改进直接提升了人工智能技术在教育、客服和内容生成等场景中的实用价值。

衍生相关工作

基于GAP数据集，多项经典研究工作得以展开，包括谷歌团队开发的基准评估框架以及后续的跨语言指代消解扩展。例如，研究者利用该数据集训练了基于Transformer的端到端模型，如BERT和SpanBERT的变体，这些模型在指代消解任务中取得了显著性能提升。此外，它还催生了针对性别平衡的对抗性评估方法，推动了公平性机器学习领域的理论进展与工具创新。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集