five

Neo-GATE

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/FBK-MT/Neo-GATE
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset card for Neo-GATE **Homepage:** [https://mt.fbk.eu/neo-gate/](https://mt.fbk.eu/neo-gate/) ## Dataset summary Neo-GATE is a bilingual corpus designed to benchmark the ability of machine translation (MT) systems to translate from English into Italian using gender-inclusive neomorphemes. It is built upon GATE [(Rarrick et al., 2023)](https://dl.acm.org/doi/10.1145/3600211.3604675), a benchmark for the evaluation of gender rewriters and gender bias in MT. Neo-GATE includes 841 `test` entries (`Neo-GATE.tsv`) and 100 `dev` entries (`Neo-GATE-dev.tsv`). Each entry is composed of an English source sentence, three Italian references which only differ for the presence of either masculine/feminine/nonbinary words, and the annotation of the target words that are relevant for the evaluation of gender-inclusive MT. The source sentences are gender-ambiguous, i.e. they provide no information about the gender of human referents. In this setting, words referring to human entities in the target language should be rendered with neomorphemes, special characters or symbols that replace masculine and feminine inflectional morphemes. Neo-GATE allows for the evaluation of any neomorpheme paradigm in Italian. For more details see the [Adaptation](#adaptation) section below. ## Data Fields `Neo-GATE.tsv` includes the following columns: - **#:** Neo-GATE unique identifier. - **GATE-ID:** A unique identifier of the entry in GATE, composed of a prefix indicating the subset of origin within GATE (e.g., `IT_2_variants`) followed by a serial number indicating the position of the entry within that subset (i.e., `001`, `002`, etc.). - **SOURCE:** The English source sentence. - **REF-M:** The Italian reference where all gender-marked terms are masculine. - **REF-F:** The Italian reference where all gender-marked terms are feminine. - **REF-TAGGED:** The Italian reference where all gender-marked terms are tagged with Neo-GATE's annotation. - **ANNOTATION:** The word level annotation. ## Configurations There are two configurations available: - `main`: with placeholders to be replaced with the desired neomorpheme paradigm. - `schwa_simple`: already adapted to the paradigm included in `schwa-simple.json` ## Dataset creation Please refer to [the original paper](https://aclanthology.org/2024.eamt-1.25/) for full details on dataset creation. ## Curation rationale Neo-GATE was designed to allow for the evaluation of gender-inclusive MT and to be adaptable to any neomorpheme paradigm in Italian. To this aim, the original Italian references found in GATE were edited so as to have placeholder tags in place of gendered morphemes and function words (articles, possessive adjectives, etc.) referred to human entities. The tags were designed to cover all parts of the grammar which express grammatical gender, and to be replaced with corresponding forms in the desired neomorpheme paradigm. ## Adaptation To adapt Neo-GATE to the desired neomorpheme paradigm, a `.json` file mapping Neo-GATE's tagset to the desired forms is required. See `schwa-complex.json` or `asterisk.json` for an example. For more information on the tagset, see Table 8 in [the original paper](https://aclanthology.org/2024.eamt-1.25/). To create the adapted references and annotations, use the `neogate_adapt.py` script with the following syntax: python neogate_adapt.py --tagset JSON_FILE_PATH --out OUTPUT_FILE_NAME This command will create two files: `OUTPUT_FILE_NAME.ref`, containing the adapted references, and `OUTPUT_FILE_NAME.ann`, containing the adapted annotations. For instance, to generate the references and the annotations adapted to the Schwa paradigm provided in the example file `schwa-complex.json`, the following command can be used: python neogate_adapt.py --tagset schwa-complex.json --out neogate_schwa This will create the two files `neogate_schwa.ref` and `neogate_schwa.ann`. By default, the script will adapt references and annotations found in `Neo-GATE.tsv`. If the `Neo-GATE.tsv` file is located in a different directory, or if you wish to use a different file (e.g., the dev set split file `Neo-GATE-dev.tsv`), you can specify the path to the file with the optional argument `--neogate`. ## Evaluation The evaluation code is available at [fbk-NEUTR-evAL](https://github.com/hlt-mt/fbk-NEUTR-evAL/blob/main/solutions/Neo-GATE.md). ## Dataset Curators - Andrea Piergentili (FBK): apiergentili@fbk.eu - Beatrice Savoldi (FBK): bsavoldi@fbk.eu - Luisa Bentivogli (FBK): bentivo@fbk.eu ## Licensing Information The Neo-GATE corpus is released under a Creative Commons Attribution 4.0 International license (CC BY 4.0). See the [LICENSE](LICENSE) file for details. ## Citation If you use Neo-GATE in your work, please cite the following paper: @inproceedings{piergentili-etal-2024-enhancing, title = "Enhancing Gender-Inclusive Machine Translation with Neomorphemes and Large Language Models", author = "Piergentili, Andrea and Savoldi, Beatrice and Negri, Matteo and Bentivogli, Luisa", booktitle = "Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)", month = jun, year = "2024", address = "Sheffield, UK", publisher = "European Association for Machine Translation (EAMT)", url = "https://aclanthology.org/2024.eamt-1.25", pages = "300--314" } ## Contributions Thanks to [@apiergentili](https://huggingface.co/apiergentili) for adding this dataset.

# Neo-GATE 数据集卡片 **主页:** [https://mt.fbk.eu/neo-gate/](https://mt.fbk.eu/neo-gate/) ## 数据集概述 Neo-GATE是一个双语语料库,旨在评测机器翻译(Machine Translation, MT)系统利用性别包容性新语素(neomorphemes)将英语译为意大利语的能力。本数据集基于GATE[(Rarrick等人,2023)](https://dl.acm.org/doi/10.1145/3600211.3604675)构建,后者是用于评估性别重写器及机器翻译中性别偏见的基准数据集。 Neo-GATE包含841条测试集条目(存储于`Neo-GATE.tsv`)与100条开发集条目(存储于`Neo-GATE-dev.tsv`)。 每条条目由英语源语句、仅在阳性/阴性/非二元词汇存在形式上存在差异的三条意大利语参考译文,以及与评估性别包容性机器翻译相关的目标词标注组成。 源语句均为性别模糊型,即未提供人类指称对象的性别信息。在此设置下,目标语言中指代人类实体的词汇应使用新语素——替代阳性与阴性屈折语素的特殊字符或符号——进行表达。 Neo-GATE可用于评估意大利语中任意新语素范式,更多细节详见下文的[适配](#adaptation)章节。 ## 数据字段 `Neo-GATE.tsv`包含以下列: - **#**:Neo-GATE唯一标识符。 - **GATE-ID**:GATE数据集中条目的唯一标识符,由两部分组成:前缀用于标识该条目在GATE中的来源子集(例如`IT_2_variants`),后缀为序列号,用于标识该条目在对应子集中的位置(如`001`、`002`等)。 - **SOURCE**:英语源语句。 - **REF-M**:所有性别标注术语均为阳性形式的意大利语参考译文。 - **REF-F**:所有性别标注术语均为阴性形式的意大利语参考译文。 - **REF-TAGGED**:所有性别标注术语均使用Neo-GATE标注体系进行标记的意大利语参考译文。 - **ANNOTATION**:词级标注信息。 ## 配置方案 本数据集提供两种配置方案: - `main`:包含占位符,可替换为目标新语素范式。 - `schwa_simple`:已适配`schwa-simple.json`中包含的新语素范式。 ## 数据集构建 数据集构建的完整细节请参阅[原论文](https://aclanthology.org/2024.eamt-1.25/)。 ## 策展依据 Neo-GATE的设计目标是支持性别包容性机器翻译的评估,并可适配意大利语中任意新语素范式。为此,研究人员对GATE中原始意大利语参考译文进行编辑,将指代人类实体的性别语素及功能词(冠词、物主形容词等)替换为占位符标签。这些标签覆盖所有表达语法性别的语法成分,可被替换为目标新语素范式中的对应形式。 ## 适配方法 若要将Neo-GATE适配至目标新语素范式,需提供一个`.json`文件,用于将Neo-GATE的标签集映射至目标形式。可参考`schwa-complex.json`或`asterisk.json`作为示例。关于标签集的更多信息,请参阅原论文中的表8。 如需生成适配后的参考译文与标注,请使用`neogate_adapt.py`脚本,语法如下: python neogate_adapt.py --tagset JSON_FILE_PATH --out OUTPUT_FILE_NAME 该命令将生成两个文件:`OUTPUT_FILE_NAME.ref`(包含适配后的参考译文)与`OUTPUT_FILE_NAME.ann`(包含适配后的标注信息)。 例如,若要生成适配示例文件`schwa-complex.json`中包含的schwa范式的参考译文与标注,可执行以下命令: python neogate_adapt.py --tagset schwa-complex.json --out neogate_schwa 该命令将生成`neogate_schwa.ref`与`neogate_schwa.ann`两个文件。 默认情况下,脚本将适配`Neo-GATE.tsv`中的参考译文与标注。若`Neo-GATE.tsv`文件位于其他目录,或您希望使用其他文件(例如开发集拆分文件`Neo-GATE-dev.tsv`),可通过可选参数`--neogate`指定目标文件路径。 ## 评估方法 评估代码可在[fbk-NEUTR-evAL](https://github.com/hlt-mt/fbk-NEUTR-evAL/blob/main/solutions/Neo-GATE.md)获取。 ## 数据集策展人 - Andrea Piergentili(FBK):apiergentili@fbk.eu - Beatrice Savoldi(FBK):bsavoldi@fbk.eu - Luisa Bentivogli(FBK):bentivo@fbk.eu ## 许可信息 Neo-GATE语料库采用知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International, CC BY 4.0)发布,详细信息请参阅`LICENSE`文件。 ## 引用方式 若您在研究中使用Neo-GATE,请引用以下论文: bibtex @inproceedings{piergentili-etal-2024-enhancing, title = "Enhancing Gender-Inclusive Machine Translation with Neomorphemes and Large Language Models", author = "Piergentili, Andrea and Savoldi, Beatrice and Negri, Matteo and Bentivogli, Luisa", booktitle = "Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)", month = jun, year = "2024", address = "Sheffield, UK", publisher = "European Association for Machine Translation (EAMT)", url = "https://aclanthology.org/2024.eamt-1.25/", pages = "300--314" } ## 贡献致谢 感谢[@apiergentili](https://huggingface.co/apiergentili)添加本数据集。
提供机构:
maas
创建时间:
2025-09-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作