gsarti/mt_geneval

Name: gsarti/mt_geneval
Creator: gsarti
Published: 2022-11-21 14:52:09
License: 暂无描述

Hugging Face2022-11-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/gsarti/mt_geneval

下载链接

链接失效反馈

官方服务：

资源简介：

MT-GenEval基准测试用于评估从英语到阿拉伯语、法语、德语、印地语、意大利语、葡萄牙语、俄语和西班牙语的性别翻译准确性。数据集包含带有性别化目标词注释的单个句子，以及带有额外上下文的对比性原始-反转翻译。数据集分为两种配置类型：`sentences`和`context`，分别包含单个句子和带有上下文的句子。数据集的大小在1K到10K之间，且由专家生成。

提供机构：

gsarti

原始信息汇总

数据集卡片 for MT-GenEval

数据集描述

数据集摘要

MT-GenEval 基准评估了从英语到 {阿拉伯语, 法语, 德语, 印地语, 意大利语, 葡萄牙语, 俄语, 西班牙语} 的性别翻译准确性。该数据集包含带有性别目标词注释的单个句子，以及带有额外前置上下文的对比原始-反转翻译。

支持的任务和排行榜

机器翻译

有关使用 MT-GenEval 进行性别准确性评估的更多详细信息，请参阅原始论文。

语言

该数据集包含从维基百科提取的源英语句子，翻译成以下语言：阿拉伯语 (ar)、法语 (fr)、德语 (de)、印地语 (hi)、意大利语 (it)、葡萄牙语 (pt)、俄语 (ru) 和西班牙语 (es)。

数据集结构

数据实例

数据集包含两种配置类型，sentences 和 context，反映了原始仓库结构，源语言和目标语言在配置名称中指定（例如 sentences_en_ar，context_en_it）。sentences 配置包含带有性别词注释的单个句子的男性化和女性化版本。以下是 sentences_en_it 分割的一个示例条目（所有 sentences_en_XX 分割具有相同的结构）：

json { "orig_id": 0, "source_feminine": "Pagratidis quickly recanted her confession, claiming she was psychologically pressured and beaten, and until the moment of her execution, she remained firm in her innocence.", "reference_feminine": "Pagratidis subito ritrattò la sua confessione, affermando che era aveva subito pressioni psicologiche e era stata picchiata, e fino al momento della sua esecuzione, rimase ferma sulla sua innocenza.", "source_masculine": "Pagratidis quickly recanted his confession, claiming he was psychologically pressured and beaten, and until the moment of his execution, he remained firm in his innocence.", "reference_masculine": "Pagratidis subito ritrattò la sua confessione, affermando che era aveva subito pressioni psicologiche e era stato picchiato, e fino al momento della sua esecuzione, rimase fermo sulla sua innocenza.", "source_feminine_annotated": "Pagratidis quickly recanted <F>her</F> confession, claiming <F>she</F> was psychologically pressured and beaten, and until the moment of <F>her</F> execution, <F>she</F> remained firm in <F>her</F> innocence.", "reference_feminine_annotated": "Pagratidis subito ritrattò la sua confessione, affermando che era aveva subito pressioni psicologiche e era <F>stata picchiata</F>, e fino al momento della sua esecuzione, rimase <F>ferma</F> sulla sua innocenza.", "source_masculine_annotated": "Pagratidis quickly recanted <M>his</M> confession, claiming <M>he</M> was psychologically pressured and beaten, and until the moment of <M>his</M> execution, <M>he</M> remained firm in <M>his</M> innocence.", "reference_masculine_annotated": "Pagratidis subito ritrattò la sua confessione, affermando che era aveva subito pressioni psicologiche e era <M>stato picchiato</M>, e fino al momento della sua esecuzione, rimase <M>fermo</M> sulla sua innocenza.", "source_feminine_keywords": "her;she;her;she;her", "reference_feminine_keywords": "stata picchiata;ferma", "source_masculine_keywords": "his;he;his;he;his", "reference_masculine_keywords": "stato picchiato;fermo" }

context 配置包含与刻板职业角色相关的不同英语源，带有额外的前置上下文和对比原始-反转翻译。以下是 context_en_it 分割的一个示例条目（所有 context_en_XX 分割具有相同的结构）：

json { "orig_id": 0, "context": "Pierpont told of entering and holding up the bank and then fleeing to Fort Wayne, where the loot was divided between him and three others.", "source": "However, Pierpont stated that Skeer was the planner of the robbery.", "reference_original": "Comunque, Pierpont disse che Skeer era il pianificatore della rapina.", "reference_flipped": "Comunque, Pierpont disse che Skeer era la pianificatrice della rapina." }

数据分割

所有 sentences_en_XX 配置在 train 分割中有 1200 个示例，在 test 分割中有 300 个示例。对于 context_en_XX 配置，示例数量取决于语言对：

配置	# 训练	# 测试
`context_en_ar`	792	1100
`context_en_fr`	477	1099
`context_en_de`	598	1100
`context_en_hi`	397	1098
`context_en_it`	465	1904
`context_en_pt`	574	1089
`context_en_ru`	583	1100
`context_en_es`	534	1096

数据集创建

从原始论文中：

在开发 MT-GenEval 时，我们的目标是创建一个现实、性别平衡的数据集，自然地包含多样化的性别现象。为此，我们从维基百科中提取英语源句子作为我们数据集的基础。我们使用 EN 性别指代词基于 Zhao et al. (2018) 提供的列表自动预选相关句子。

有关数据集创建的更多信息，请参阅原始文章 MT-GenEval: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation。

附加信息

数据集策展人

MT-GenEval 的原始作者是该数据集的策展人。对于此 🤗 Datasets 版本的任何问题或更新，请联系 gabriele.sarti996@gmail.com。

许可信息

该数据集根据 Creative Commons Attribution-ShareAlike 3.0 International License 进行许可。

引用信息

如果您在工作中使用这些语料库，请引用作者。

bibtex @inproceedings{currey-etal-2022-mtgeneval, title = "{MT-GenEval}: {A} Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation", author = "Currey, Anna and Nadejde, Maria and Pappagari, Raghavendra and Mayer, Mia and Lauly, Stanislas, and Niu, Xing and Hsu, Benjamin and Dinu, Georgiana", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2211.01355", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集