five

GNR-it

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/FBK-MT/GNR-it
下载链接
链接失效反馈
官方服务:
资源简介:
# GNR-it Dataset ## Table of Contents 1. [Overview](#overview) 2. [Usage](#usage) 3. [License](#license) 4. [Citation](#citation) ## Overview The **GNR-it** dataset contains pairs of gendered and gender-neutral Italian sentences. We release this dataset to ensure reproducibility of the experiments in the paper [Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs](https://arxiv.org/abs/2509.13480), accepted at CLiC-it 2025. The dataset is derived from the data originally created to train the gender-neutrality classifier [GeNTE-evaluator](https://huggingface.co/FBK-MT/GeNTE-evaluator). The creation and curation of the original dataset is described in the paper [Hi Guys or Hi Folks? Benchmarking Gender-Neutral Machine Translation with the GeNTE Corpus (Piergentili et al., 2023)](https://aclanthology.org/2023.emnlp-main.873/). Entries in this dataset include the following columns: * **id**: a progressive identifier * **gendered**: the gendered sentence * **neutral**: the gender-neutral sentence To facilitate reproducibility of our paper’s experiments, we release both splits: - **full**: the complete set of 162,778 pairs - **clean**: a subset of 81,389 pairs selected based on their BERTScore These two splits dataset were used to fine-tune the following models: - [Qwen3-8B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-clean) - [Qwen3-8B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-full) - [Qwen3-14B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-clean) - [Qwen3-14B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-full) ## Usage ```python from datasets import load_dataset # Full set full_data = load_dataset("FBK-MT/GNR-it", split="full") # Clean set clean_data = load_dataset("FBK-MT/GNR-it", split="clean") ``` ## License We release this dataset under the Creative Commons Attribution 4.0 International license (CC BY 4.0). ## Citation If you this dataset in your work, please cite: ``` @misc{piergentili2025genderneutralrewritingitalianmodels, title={Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs}, author={Andrea Piergentili and Beatrice Savoldi and Matteo Negri and Luisa Bentivogli}, year={2025}, eprint={2509.13480}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.13480}, } ``` ## Contributions Thanks to [@apiergentili](https://huggingface.co/apiergentili) for adding this dataset.

# GNR-it 数据集 ## 目录 1. [概述](#overview) 2. [使用方法](#usage) 3. [许可协议](#license) 4. [引用](#citation) ## 概述 **GNR-it** 数据集包含成对的意大利语性别化语句与中性语句。 我们发布本数据集以复现论文《意大利语中性性别改写:模型、方法与权衡》(已被CLiC-it 2025收录,预印本链接:https://arxiv.org/abs/2509.13480)中的实验结果。 本数据集源自最初用于训练性别中性分类器 GeNTE-evaluator(https://huggingface.co/FBK-MT/GeNTE-evaluator)的原始数据,原始数据集的构建与整理流程可参考论文《嗨,伙计们还是嗨,各位?利用GeNTE语料库评测性别中性机器翻译》(Piergentili 等,2023,链接:https://aclanthology.org/2023.emnlp-main.873/)。 本数据集的每条条目包含以下字段: * **id**:渐进式唯一标识符 * **gendered**:性别化语句(gendered) * **neutral**:中性语句(neutral) 为便于复现本论文的实验,我们发布了两个数据划分版本: - **full**:完整数据集,包含162,778条语句对 - **clean**:清洗后数据集,基于BERTScore筛选得到的81,389条语句对子集 上述两个数据划分版本被用于微调以下模型: - [Qwen3-8B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-clean) - [Qwen3-8B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-full) - [Qwen3-14B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-clean) - [Qwen3-14B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-full) ## 使用方法 python from datasets import load_dataset # 完整数据集 full_data = load_dataset("FBK-MT/GNR-it", split="full") # 清洗后数据集 clean_data = load_dataset("FBK-MT/GNR-it", split="clean") ## 许可协议 本数据集采用知识共享署名4.0国际许可协议(CC BY 4.0)发布。 ## 引用 若在研究工作中使用本数据集,请引用以下文献: bibtex @misc{piergentili2025genderneutralrewritingitalianmodels, title={Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs}, author={Andrea Piergentili and Beatrice Savoldi and Matteo Negri and Luisa Bentivogli}, year={2025}, eprint={2509.13480}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.13480}, } ## 贡献 感谢 [@apiergentili](https://huggingface.co/apiergentili) 为本数据集添加至Hugging Face Hub。
提供机构:
maas
创建时间:
2025-09-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作