five

mGeNTE

收藏
魔搭社区2025-12-05 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/FBK-MT/mGeNTE
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for mGeNTE **Homepage:** https://mt.fbk.eu/mgente/ **Code:** https://github.com/g8a9/mgente-gap The mGeNTE dataset is introduced in the paper "[Mind the Inclusivity Gap: Multilingual Gender-Neutral Translation Evaluation with mGeNTE](https://arxiv.org/abs/2501.09409v3)", presented at [EMNLP 2025](https://2025.emnlp.org/). ## Dataset Summary mGeNTE (**M**ultilingual **Ge**nder-**N**eutral **T**ranslation **E**valuation) is a natural, multilingual corpus designed to benchmark gender-neutral language and automatic translation. mGente is built upon European Parliament speech data extracted from the [Europarl corpus](https://www.statmt.org/europarl/archives.html), and represents a multilingual expansion of the bilingual [GeNTE](https://huggingface.co/datasets/FBK-MT/GeNTE) dataset (v1.0, now superseded by mGeNTE). For each language pair, mGeNTE comprises 1,500 parallel sentences (tot. 6,000 entries), which are enriched with manual annotations and feature a balanced distribution of translation phenomena that either entail i) a gender-neutral translation (`set-N`), or ii) a gendered translation in the target language (`set-G`). ### Supported Tasks and Languages mGeNTE supports cross-lingual (en-it, en-es, en-de, en-el) gender inclusive translation and intra-lingual (it-it, es-es, de-de, el-el) gender inclusive rewriting tasks. ## Dataset Structure ### Data Instances The dataset consists of two configuration types: - **`mGeNTE`:** The complete mGeNTE corpus and its annotations, consisting of a tsv file for each language pair - **`mGeNTE_common`:** Subset of the mGeNTE corpus that comprises three alternative gender-neutral reference translations ### Data Fields Each tsv file in **`mGeNTE`** is organized into 10 tab-separated columns as follows: - ID: The unique mGeNTE ID. - Europarl_ID: The original sentence ID from Europarl's common-test-set 2. - SET: Indicates whether the entry belongs to the Set-G or the Set-N subportion of the corpus. - SRC: The English source sentence. - REF-G: The gendered reference translation in the target language. - REF-N: The gender-neutral reference in the target language, produced by a professional translator. - COMMON: Indicates whether the entry is part of mGeNTE common-set (yes/no). - GENDER: For entries belonging to the Set-G, indicates if the entry is Feminine or Masculine (F/M). - REF-G_ann: Tokenized version of the gendered reference translation with target gendered words annotated. - G-WORDS: List of annotated target gendered words separated by "&". For entries of the common set, REF-N provides the gender-neutral reference translation n. 2. Each tsv file in **`mGeNTE_common`** comprises 200 entries organized into 11 tab-separated columns as follows: - ID: The unique mGeNTE ID. - Europarl_ID: The original sentence ID from Europarl's common-test-set 2. - SET: Indicates whether the entry belongs to the Set-G or the Set-N subportion of the corpus. - SRC: The English source sentence. - REF-G: The gendered reference translation in the target language. - REF-N1: The gender-neutral reference in the target language produced by Translator 1. - REF-N2: The gender-neutral reference in the target language produced by Translator 2. - REF-N3: The gender-neutral reference in the target language produced by Translator 3. - GENDER: For entries belonging to the Set-G, indicates if the entry is Feminine or Masculine (F/M). - REF-G_ann: Tokenized version of the gendered reference translation with target gendered words annotated. - G-WORDS: List of annotated target gendered words separated by "&". ## Dataset Creation Refer to the [paper](https://arxiv.org/abs/2501.09409) for full details on dataset creation. ### Curation Rationale mGeNTE is designed to test gender-neutral language modeling and evaluate models’ ability to perform gender-neutral translations under desirable circumstances. In fact, when referents’ gender is unknown or irrelevant, undue gender inferences should not be made, and translation should be neutral. Instead, when a referent’s gender is relevant and known, MT should not over-generalize to neutral translations. The corpus hence consists of parallel sentences with mentions to human referents that equally represent two translation scenarios: - `Set-N`: featuring gender-ambiguous source sentences that require to be neutrally rendered in translation; - `Set-G`: featuring gender-unambiguous source sentences, which shall be properly rendered with gendered (masculine or feminine) forms in translation. Across the three available language pairs, mGente features 987 fully parallel en-it/es/de/el segments to maximize comparability, i.e. `Parallel set`. Parallel multilingual instances feature the same string in the `SRC` data field. ### Source Data The dataset contains text data extracted and edited from the Europarl Corpus ([common test set 2](https://www.statmt.org/europarl/archives.html)), and all rights of the data belong to the European Union and/or respective copyright holders. Please refer to Europarl “[Terms of Use](https://www.statmt.org/europarl/archives.html)” for details. ### Annotations For each sentence pair extracted from Europarl (src, ref), mGeNTE includes an additional reference in the target language, which differs from the original one only in that it refers to the human entities with neutral expressions. The neutral reference translations were created by professionals based on the following language-specific guidelines: - [en-it](https://fbk.sharepoint.com/:b:/s/MTUnit/ET75jsZb-ZdJgfFcEsJiKo4Bugbw7E7gUutUyGlSz3U3mw?e=uNBOWc) - [en-es](https://fbk.sharepoint.com/:b:/s/MTUnit/EXiEnDUA4QpJqALJF9R0U4oBy6j45uL8y2a04fSyRoeOlQ?e=EMiZsP) - [en-de](https://fbk.sharepoint.com/:b:/s/MTUnit/ERpaw9A_2ENBnG0Y257xgdsB5l4Ntp2DdEcJGLHtVoEjcA?e=KZcjjw) - [en-el](https://fbk.sharepoint.com/:b:/s/MTUnit/Eb2ibn4hKtRClImVT-pZKzgBOrnXjtz2SRrsKDz5W6K6Kw?e=g2ZuIL) ### Dataset Curators The authors of mGeNTE are the dataset curators: en-it (A. Piergentili and B. Savoldi), en-es (Eleonora Cupini and B. Savoldi), en-de (M. Thin and A. Lauscher), en-el (E. Gkovedarou). For curating efforts coordination, refer to Beatrice Savoldi (FBK) at <bsavoldi@fbk.eu> ### Licensing Information The mGeNTE corpus is released under a Creative Commons Attribution 4.0 International license (CC BY 4.0). ## Citation ```bibtex @inproceedings{savoldi2025mind, title={Mind the Inclusivity Gap: Multilingual Gender-Neutral Translation Evaluation with mGeNTE}, author={ Savoldi, Beatrice and Attanasio, Giuseppe and Cupin, Eleonora and Gkovedarou, Eleni and Hackenbuchner, Jani{\c{c}}a and Lauscher, Anne and Negri, Matteo and Piergentili, Andrea and Thind, Manjinder and Bentivogli, Luisa }, booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, year={2025}, url={https://arxiv.org/abs/2501.09409} } ``` ## Contributions Thanks to [@BSavoldi](https://huggingface.co/BSavoldi) for adding this dataset.

# mGeNTE 数据集卡片 **项目主页:** https://mt.fbk.eu/mgente/ **代码仓库:** https://github.com/g8a9/mgente-gap mGeNTE 数据集首次在发表于 [EMNLP 2025](https://2025.emnlp.org/) 的论文《Mind the Inclusivity Gap: Multilingual Gender-Neutral Translation Evaluation with mGeNTE》(https://arxiv.org/abs/2501.09409v3)中被提出。 ## 数据集概览 mGeNTE(**M**ultilingual **Ge**nder-**N**eutral **T**ranslation **E**valuation,多语言性别中立翻译评估)是一款自然语言多语言语料库,旨在为性别中立语言建模与自动翻译任务提供基准测试。 mGeNTE 基于从 [Europarl语料库(Europarl Corpus)](https://www.statmt.org/europarl/archives.html) 提取的欧洲议会演讲数据构建,是双语 [GeNTE](https://huggingface.co/datasets/FBK-MT/GeNTE) 数据集(v1.0,现已被mGeNTE取代)的多语言扩展版本。 针对每个语言对,mGeNTE 包含1500个平行句(总计6000条数据),经人工标注丰富,且均衡覆盖两类翻译场景:一是目标语需译为性别中立形式的`set-N`(中性集),二是目标语需译为性别特指形式的`set-G`(性别集)。 ### 支持任务与语言 mGeNTE 支持跨语言(en-it、en-es、en-de、en-el)性别包容性翻译任务,以及单语言(it-it、es-es、de-de、el-el)性别包容性重写任务。 ## 数据集结构 ### 数据实例 数据集包含两种配置类型: - **`mGeNTE`**:完整的mGeNTE语料库及其标注,每个语言对对应一个TSV(Tab-Separated Values,制表符分隔值)文件 - **`mGeNTE_common`**:mGeNTE语料库的子集,包含三种可选的性别中立参考译文 ### 数据字段 `mGeNTE` 配置下的每个TSV文件包含10个制表符分隔的字段,具体如下: - ID:mGeNTE唯一标识符 - Europarl_ID:Europarl通用测试集2中的原始句子ID - SET:指示该条目属于语料库的Set-G还是Set-N子部分 - SRC:英语源句 - REF-G:目标语中的性别特指参考译文 - REF-N:由专业译者生成的目标语性别中立参考译文 - COMMON:指示该条目是否属于mGeNTE通用集(是/否) - GENDER:针对Set-G条目,指示其性别为阴性(F)或阳性(M) - REF-G_ann:对性别特指参考译文进行分词后的版本,其中目标语性别词汇已标注 - G-WORDS:以“&”分隔的已标注目标语性别词汇列表 针对通用集条目,REF-N字段提供第二版性别中立参考译文。 `mGeNTE_common` 配置下的每个TSV文件包含200条数据,共11个制表符分隔的字段,具体如下: - ID:mGeNTE唯一标识符 - Europarl_ID:Europarl通用测试集2中的原始句子ID - SET:指示该条目属于语料库的Set-G还是Set-N子部分 - SRC:英语源句 - REF-G:目标语中的性别特指参考译文 - REF-N1:译者1生成的目标语性别中立参考译文 - REF-N2:译者2生成的目标语性别中立参考译文 - REF-N3:译者3生成的目标语性别中立参考译文 - GENDER:针对Set-G条目,指示其性别为阴性(F)或阳性(M) - REF-G_ann:对性别特指参考译文进行分词后的版本,其中目标语性别词汇已标注 - G-WORDS:以“&”分隔的已标注目标语性别词汇列表 ## 数据集构建 详细的数据集构建流程请参阅论文(https://arxiv.org/abs/2501.09409)。 ### 遴选依据 mGeNTE 旨在测试性别中立语言建模能力,并评估模型在适宜场景下完成性别中立翻译的能力。事实上,当指代对象的性别未知或无关时,不应进行不当的性别推断,译文应保持中立;而当指代对象的性别明确且相关时,机器翻译不应泛化为中性译文。因此,该语料库包含平行句,其中均涉及人类指代,且均衡覆盖两类翻译场景: - `Set-N`:包含性别模糊的源句,翻译时需采用中性表达 - `Set-G`:包含性别明确的源句,翻译时需采用正确的性别特指(阴性或阳性)形式 在三个可用语言对中,mGeNTE 包含987个完全平行的en-it/es/de/el片段,以最大化可比性,即`Parallel set`(平行集)。平行多语言实例的`SRC`字段内容完全一致。 ### 源数据 本数据集包含从Europarl语料库(通用测试集2,https://www.statmt.org/europarl/archives.html)提取并编辑的文本数据,数据的所有权利归属于欧盟及相关版权方。有关详情请参阅Europarl的“使用条款”(https://www.statmt.org/europarl/archives.html)。 ### 标注 针对从Europarl提取的每个句对(源句,参考译文),mGeNTE 新增了一条目标语参考译文,该译文与原参考译文的唯一区别在于使用中性表达指代人类实体。 性别中立参考译文由专业译者根据各语言专属指南创建: - [en-it](https://fbk.sharepoint.com/:b:/s/MTUnit/ET75jsZb-ZdJgfFcEsJiKo4Bugbw7E7gUutUyGlSz3U3mw?e=uNBOWc) - [en-es](https://fbk.sharepoint.com/:b:/s/MTUnit/EXiEnDUA4QpJqALJF9R0U4oBy6j45uL8y2a04fSyRoeOlQ?e=EMiZsP) - [en-de](https://fbk.sharepoint.com/:b:/s/MTUnit/ERpaw9A_2ENBnG0Y257xgdsB5l4Ntp2DdEcJGLHtVoEjcA?e=KZcjjw) - [en-el](https://fbk.sharepoint.com/:b:/s/MTUnit/Eb2ibn4hKtRClImVT-pZKzgBOrnXjtz2SRrsKDz5W6K6Kw?e=g2ZuIL) ### 数据集整理者 mGeNTE 的作者即为数据集整理者:en-it(A. Piergentili与B. Savoldi)、en-es(Eleonora Cupini与B. Savoldi)、en-de(M. Thin与A. Lauscher)、en-el(E. Gkovedarou)。有关整理工作的协调事宜,请联系Beatrice Savoldi(FBK),邮箱:<bsavoldi@fbk.eu> ### 许可信息 mGeNTE 语料库采用知识共享署名4.0国际许可协议(CC BY 4.0)发布。 ## 引用 bibtex @inproceedings{savoldi2025mind, title={Mind the Inclusivity Gap: Multilingual Gender-Neutral Translation Evaluation with mGeNTE}, author={ Savoldi, Beatrice and Attanasio, Giuseppe and Cupin, Eleonora and Gkovedarou, Eleni and Hackenbuchner, Jani{c{c}}a and Lauscher, Anne and Negri, Matteo and Piergentili, Andrea and Thind, Manjinder and Bentivogli, Luisa }, booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, year={2025}, url={https://arxiv.org/abs/2501.09409} } ## 贡献 感谢 [@BSavoldi](https://huggingface.co/BSavoldi) 为本数据集添加至Hugging Face。
提供机构:
maas
创建时间:
2025-09-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作