gender-bias-PE

Name: gender-bias-PE
Creator: maas
Published: 2025-12-05 11:41:52
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/FBK-MT/gender-bias-PE

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for gender-bias-PE data ## Dataset Description The gender-bias-PE dataset contains the post-edits and associated behavioural data of the human-centered experiments presented in the paper: [*What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study*](https://arxiv.org/abs/2410.00545) accepted at EMNLP 2024. The dataset allows to study the impact of gender bias in Machine Translation (MT) via human-centered measures like post-editing effort (i.e. temporal and technical). The English source sentences contained in the dataset have been automatically translated with Google Translate into Italian, Spanish, and German. 88 human participants were tasked with the *light* post-editing of the MT output, primarly ensure correct gender translation in the target language. The data represent multilanguage, multiuser, and multidataset conditions. See the associated paper for full details. ## Dataset Structure ### Data Config - `all`: load all the 3 language pairs in one single dataset instance - `en-it`: load the en-it portion of the dataset - `en-es`: load the en-es portion of the dataset - `en-de`: load the en-de portion of the dataset ### Source Data The dataset is created based on a subset of parallel en-* sentences extracted from the existing [MT-GenEval](https://aclanthology.org/2022.emnlp-main.288.pdf) corpus. We adapted several target references translation from the original corpus (see Appendix B.1.2 in [Savoldi et al., (2024)](https://arxiv.org/abs/2410.00545)). ⚠️ The release of the [MuST-SHE](https://aclanthology.org/2020.acl-main.619/) sentences and post-edits is temporarily suspended is temporarily suspended pending clarification of the new [policy](https://www.ted.com/about/our-organization/our-policies-terms/ted-com-terms-of-use) adopted by TED for the use of its proprietary data. #### Data Fields `lang`: target language - it - es - de `dataset`: original dataset - mtgen_un: subset of [MTGenEval](https://github.com/amazon-science/machine-translation-gender-eval/tree/main/data/context) with unambiguous gender in the source - mtgen_a: subset of [MTGenEval](https://github.com/amazon-science/machine-translation-gender-eval/tree/main/data/sentences/test) with ambiguous gender in the source `user_type`: type of user that carried out the post-edit - professional: experienced translator - student: unexperienced user `original_id`: unique sentence identifier from the original dataset `gender`: gender expressed in the original reference translation and in the post-edited sentence - F: feminine - M: masculine `segment`: source English sentence `tgt`: target reference translation `raw_word_count`: number of source words `time_to_edit`: time to edit in milliseconds `suggestion`: Google Translate output `secs_per_word`: time to edit per source word `parsed_time_to_edit`: time to edit as duration `last_translation`: post-edited output `HTER`: sentence-level HTER score ## Annotation The post-edits were carried out following dedicated guidelines, which are available at [https://github.com/bsavoldi/post-edit_guidelines](https://github.com/bsavoldi/post-edit_guidelines) ## License The MTGenvEval-based portion of the dataset is released with the same [CC-BY-SA-3.0 license](https://github.com/amazon-science/machine-translation-gender-eval/blob/main/LICENSE) as the original corpus. ## Citation The dataset is associated with a paper accepted at EMNLP 2024. Please cite the paper when referencing the gender-bias-pe corpus as: ``` @inproceedings{savoldi2024whattheharm, title={{What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study}}, author={Savoldi, Beatrice and Papi, Sara and Negri, Matteo and Guerberof, Ana and Bentivogli, Luisa}, year={2024}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = {nov}, year = {2023}, address = {Miami}, publisher = {Association for Computational Linguistics} } ``` ## Contribution Thanks to [@Bsavoldi](https://huggingface.co/BSavoldi) for adding this dataset.

# gender-bias-PE 数据集卡片 ## 数据集描述 gender-bias-PE 数据集包含发表于 EMNLP 2024 的论文《*What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study*》（https://arxiv.org/abs/2410.00545）中所呈现的以人为中心实验的后编辑结果与关联行为数据。该数据集可通过以人为中心的测量方式（如后编辑工作量，即时间层面与技术层面的工作量），研究机器翻译（Machine Translation）中的性别偏见影响。数据集中包含的英语源语句已通过谷歌翻译（Google Translate）自动译为意大利语、西班牙语与德语。88名人类受试者被要求对机器翻译输出进行**轻量级**后编辑，主要任务为确保目标语言中的性别翻译准确无误。本数据集涵盖多语言、多用户与多数据集场景，完整细节请参阅相关论文。 ## 数据集结构 ### 数据集配置 - `all`：在单个数据集实例中加载全部3种语言对 - `en-it`：加载数据集的英-意语对部分 - `en-es`：加载数据集的英-西班牙语对部分 - `en-de`：加载数据集的英-德语对部分 ### 源数据本数据集基于从现有[MT-GenEval](https://aclanthology.org/2022.emnlp-main.288.pdf)语料库中提取的并行英-*语句子集构建。我们调整了原语料库中的部分目标参考译文（详见[Savoldi等人，2024](https://arxiv.org/abs/2410.00545)的附录B.1.2）。 ⚠️ [MuST-SHE](https://aclanthology.org/2020.acl-main.619/) 语句及其后编辑内容的发布已暂时暂停，待TED就其专有数据使用的新[政策](https://www.ted.com/about/our-organization/our-policies-terms/ted-com-terms-of-use)作出明确说明后恢复。 #### 数据字段 `lang`：目标语言 - it：意大利语 - es：西班牙语 - de：德语 `dataset`：原始数据集来源 - mtgen_un：源语句性别明确的[MTGenEval](https://github.com/amazon-science/machine-translation-gender-eval/tree/main/data/context)子集 - mtgen_a：源语句性别模糊的[MTGenEval](https://github.com/amazon-science/machine-translation-gender-eval/tree/main/data/sentences/test)子集 `user_type`：执行后编辑的用户类型 - professional：资深译员 - student：无经验用户（学生） `original_id`：原始数据集中的语句唯一标识符 `gender`：原始参考译文与后编辑语句中体现的性别 - F：阴性 - M：阳性 `segment`：英语源语句 `tgt`：目标参考译文 `raw_word_count`：源语句单词数 `time_to_edit`：编辑耗时，单位为毫秒 `suggestion`：谷歌翻译（Google Translate）输出结果 `secs_per_word`：单源语词编辑耗时 `parsed_time_to_edit`：以时长形式表示的编辑耗时 `last_translation`：后编辑后的输出结果 `HTER`：句子级HTER得分 ## 标注说明本次后编辑工作遵循专用指南，指南可在[https://github.com/bsavoldi/post-edit_guidelines](https://github.com/bsavoldi/post-edit_guidelines)获取。 ## 许可协议本数据集基于MT-GenEval的部分与原始语料库采用相同的[CC-BY-SA-3.0许可协议](https://github.com/amazon-science/machine-translation-gender-eval/blob/main/LICENSE)发布。 ## 引用说明本数据集关联一篇被EMNLP 2024收录的论文。在引用gender-bias-pe数据集时，请按以下格式引用该论文： @inproceedings{savoldi2024whattheharm, title={{What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study}}, author={Savoldi, Beatrice and Papi, Sara and Negri, Matteo and Guerberof, Ana and Bentivogli, Luisa}, year={2024}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, month = {nov}, year = {2023}, address = {Miami}, publisher = {Association for Computational Linguistics} } ## 贡献致谢感谢[@Bsavoldi](https://huggingface.co/BSavoldi)贡献本数据集。

提供机构：

maas

创建时间：

2025-09-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集