five

gsarti/mt_geneval

收藏
Hugging Face2022-11-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/gsarti/mt_geneval
下载链接
链接失效反馈
官方服务:
资源简介:
MT-GenEval基准测试用于评估从英语到阿拉伯语、法语、德语、印地语、意大利语、葡萄牙语、俄语和西班牙语的性别翻译准确性。数据集包含带有性别化目标词注释的单个句子,以及带有额外上下文的对比性原始-反转翻译。数据集分为两种配置类型:`sentences`和`context`,分别包含单个句子和带有上下文的句子。数据集的大小在1K到10K之间,且由专家生成。
提供机构:
gsarti
原始信息汇总

数据集卡片 for MT-GenEval

数据集描述

数据集摘要

MT-GenEval 基准评估了从英语到 {阿拉伯语, 法语, 德语, 印地语, 意大利语, 葡萄牙语, 俄语, 西班牙语} 的性别翻译准确性。该数据集包含带有性别目标词注释的单个句子,以及带有额外前置上下文的对比原始-反转翻译。

支持的任务和排行榜

机器翻译

有关使用 MT-GenEval 进行性别准确性评估的更多详细信息,请参阅原始论文。

语言

该数据集包含从维基百科提取的源英语句子,翻译成以下语言:阿拉伯语 (ar)、法语 (fr)、德语 (de)、印地语 (hi)、意大利语 (it)、葡萄牙语 (pt)、俄语 (ru) 和西班牙语 (es)。

数据集结构

数据实例

数据集包含两种配置类型,sentencescontext,反映了原始仓库结构,源语言和目标语言在配置名称中指定(例如 sentences_en_arcontext_en_it)。sentences 配置包含带有性别词注释的单个句子的男性化和女性化版本。以下是 sentences_en_it 分割的一个示例条目(所有 sentences_en_XX 分割具有相同的结构):

json { "orig_id": 0, "source_feminine": "Pagratidis quickly recanted her confession, claiming she was psychologically pressured and beaten, and until the moment of her execution, she remained firm in her innocence.", "reference_feminine": "Pagratidis subito ritrattò la sua confessione, affermando che era aveva subito pressioni psicologiche e era stata picchiata, e fino al momento della sua esecuzione, rimase ferma sulla sua innocenza.", "source_masculine": "Pagratidis quickly recanted his confession, claiming he was psychologically pressured and beaten, and until the moment of his execution, he remained firm in his innocence.", "reference_masculine": "Pagratidis subito ritrattò la sua confessione, affermando che era aveva subito pressioni psicologiche e era stato picchiato, e fino al momento della sua esecuzione, rimase fermo sulla sua innocenza.", "source_feminine_annotated": "Pagratidis quickly recanted <F>her</F> confession, claiming <F>she</F> was psychologically pressured and beaten, and until the moment of <F>her</F> execution, <F>she</F> remained firm in <F>her</F> innocence.", "reference_feminine_annotated": "Pagratidis subito ritrattò la sua confessione, affermando che era aveva subito pressioni psicologiche e era <F>stata picchiata</F>, e fino al momento della sua esecuzione, rimase <F>ferma</F> sulla sua innocenza.", "source_masculine_annotated": "Pagratidis quickly recanted <M>his</M> confession, claiming <M>he</M> was psychologically pressured and beaten, and until the moment of <M>his</M> execution, <M>he</M> remained firm in <M>his</M> innocence.", "reference_masculine_annotated": "Pagratidis subito ritrattò la sua confessione, affermando che era aveva subito pressioni psicologiche e era <M>stato picchiato</M>, e fino al momento della sua esecuzione, rimase <M>fermo</M> sulla sua innocenza.", "source_feminine_keywords": "her;she;her;she;her", "reference_feminine_keywords": "stata picchiata;ferma", "source_masculine_keywords": "his;he;his;he;his", "reference_masculine_keywords": "stato picchiato;fermo" }

context 配置包含与刻板职业角色相关的不同英语源,带有额外的前置上下文和对比原始-反转翻译。以下是 context_en_it 分割的一个示例条目(所有 context_en_XX 分割具有相同的结构):

json { "orig_id": 0, "context": "Pierpont told of entering and holding up the bank and then fleeing to Fort Wayne, where the loot was divided between him and three others.", "source": "However, Pierpont stated that Skeer was the planner of the robbery.", "reference_original": "Comunque, Pierpont disse che Skeer era il pianificatore della rapina.", "reference_flipped": "Comunque, Pierpont disse che Skeer era la pianificatrice della rapina." }

数据分割

所有 sentences_en_XX 配置在 train 分割中有 1200 个示例,在 test 分割中有 300 个示例。对于 context_en_XX 配置,示例数量取决于语言对:

配置 # 训练 # 测试
context_en_ar 792 1100
context_en_fr 477 1099
context_en_de 598 1100
context_en_hi 397 1098
context_en_it 465 1904
context_en_pt 574 1089
context_en_ru 583 1100
context_en_es 534 1096

数据集创建

从原始论文中:

在开发 MT-GenEval 时,我们的目标是创建一个现实、性别平衡的数据集,自然地包含多样化的性别现象。为此,我们从维基百科中提取英语源句子作为我们数据集的基础。我们使用 EN 性别指代词基于 Zhao et al. (2018) 提供的列表自动预选相关句子。

有关数据集创建的更多信息,请参阅原始文章 MT-GenEval: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation

附加信息

数据集策展人

MT-GenEval 的原始作者是该数据集的策展人。对于此 🤗 Datasets 版本的任何问题或更新,请联系 gabriele.sarti996@gmail.com

许可信息

该数据集根据 Creative Commons Attribution-ShareAlike 3.0 International License 进行许可。

引用信息

如果您在工作中使用这些语料库,请引用作者。

bibtex @inproceedings{currey-etal-2022-mtgeneval, title = "{MT-GenEval}: {A} Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation", author = "Currey, Anna and Nadejde, Maria and Pappagari, Raghavendra and Mayer, Mia and Lauly, Stanislas, and Niu, Xing and Hsu, Benjamin and Dinu, Georgiana", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2211.01355", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作