GNR-it

Name: GNR-it
Creator: maas
Published: 2025-12-05 16:51:12
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/FBK-MT/GNR-it

下载链接

链接失效反馈

官方服务：

资源简介：

# GNR-it Dataset ## Table of Contents 1. [Overview](#overview) 2. [Usage](#usage) 3. [License](#license) 4. [Citation](#citation) ## Overview The **GNR-it** dataset contains pairs of gendered and gender-neutral Italian sentences. We release this dataset to ensure reproducibility of the experiments in the paper [Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs](https://arxiv.org/abs/2509.13480), accepted at CLiC-it 2025. The dataset is derived from the data originally created to train the gender-neutrality classifier [GeNTE-evaluator](https://huggingface.co/FBK-MT/GeNTE-evaluator). The creation and curation of the original dataset is described in the paper [Hi Guys or Hi Folks? Benchmarking Gender-Neutral Machine Translation with the GeNTE Corpus (Piergentili et al., 2023)](https://aclanthology.org/2023.emnlp-main.873/). Entries in this dataset include the following columns: * **id**: a progressive identifier * **gendered**: the gendered sentence * **neutral**: the gender-neutral sentence To facilitate reproducibility of our paper’s experiments, we release both splits: - **full**: the complete set of 162,778 pairs - **clean**: a subset of 81,389 pairs selected based on their BERTScore These two splits dataset were used to fine-tune the following models: - [Qwen3-8B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-clean) - [Qwen3-8B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-full) - [Qwen3-14B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-clean) - [Qwen3-14B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-full) ## Usage ```python from datasets import load_dataset # Full set full_data = load_dataset("FBK-MT/GNR-it", split="full") # Clean set clean_data = load_dataset("FBK-MT/GNR-it", split="clean") ``` ## License We release this dataset under the Creative Commons Attribution 4.0 International license (CC BY 4.0). ## Citation If you this dataset in your work, please cite: ``` @misc{piergentili2025genderneutralrewritingitalianmodels, title={Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs}, author={Andrea Piergentili and Beatrice Savoldi and Matteo Negri and Luisa Bentivogli}, year={2025}, eprint={2509.13480}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.13480}, } ``` ## Contributions Thanks to [@apiergentili](https://huggingface.co/apiergentili) for adding this dataset.

# GNR-it 数据集 ## 目录 1. [概述](#overview) 2. [使用方法](#usage) 3. [许可协议](#license) 4. [引用](#citation) ## 概述 **GNR-it** 数据集包含成对的意大利语性别化语句与中性语句。我们发布本数据集以复现论文《意大利语中性性别改写：模型、方法与权衡》（已被CLiC-it 2025收录，预印本链接：https://arxiv.org/abs/2509.13480）中的实验结果。本数据集源自最初用于训练性别中性分类器 GeNTE-evaluator（https://huggingface.co/FBK-MT/GeNTE-evaluator）的原始数据，原始数据集的构建与整理流程可参考论文《嗨，伙计们还是嗨，各位？利用GeNTE语料库评测性别中性机器翻译》（Piergentili 等，2023，链接：https://aclanthology.org/2023.emnlp-main.873/）。本数据集的每条条目包含以下字段： * **id**：渐进式唯一标识符 * **gendered**：性别化语句（gendered） * **neutral**：中性语句（neutral）为便于复现本论文的实验，我们发布了两个数据划分版本： - **full**：完整数据集，包含162,778条语句对 - **clean**：清洗后数据集，基于BERTScore筛选得到的81,389条语句对子集上述两个数据划分版本被用于微调以下模型： - [Qwen3-8B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-clean) - [Qwen3-8B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-full) - [Qwen3-14B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-clean) - [Qwen3-14B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-full) ## 使用方法 python from datasets import load_dataset # 完整数据集 full_data = load_dataset("FBK-MT/GNR-it", split="full") # 清洗后数据集 clean_data = load_dataset("FBK-MT/GNR-it", split="clean") ## 许可协议本数据集采用知识共享署名4.0国际许可协议（CC BY 4.0）发布。 ## 引用若在研究工作中使用本数据集，请引用以下文献： bibtex @misc{piergentili2025genderneutralrewritingitalianmodels, title={Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs}, author={Andrea Piergentili and Beatrice Savoldi and Matteo Negri and Luisa Bentivogli}, year={2025}, eprint={2509.13480}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.13480}, } ## 贡献感谢 [@apiergentili](https://huggingface.co/apiergentili) 为本数据集添加至Hugging Face Hub。

提供机构：

maas

创建时间：

2025-09-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集