FBK-MT/GNR-it

Name: FBK-MT/GNR-it
Creator: FBK-MT
Published: 2025-09-18 06:30:19
License: 暂无描述

Hugging Face2025-09-18 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/FBK-MT/GNR-it

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 configs: - config_name: default data_files: - split: clean path: data/clean-* - split: full path: data/full-* dataset_info: features: - name: id dtype: int64 - name: gendered dtype: string - name: neutral dtype: string splits: - name: clean num_bytes: 25496599 num_examples: 81389 - name: full num_bytes: 50772270 num_examples: 162778 download_size: 28044975 dataset_size: 76268869 task_categories: - text-classification - text-generation language: - it tags: - fairness - rewriting - gender-inclusive - gender-neutral size_categories: - 100K<n<1M --- # GNR-it Dataset ## Table of Contents 1. [Overview](#overview) 2. [Usage](#usage) 3. [License](#license) 4. [Citation](#citation) ## Overview The **GNR-it** dataset contains pairs of gendered and gender-neutral Italian sentences. We release this dataset to ensure reproducibility of the experiments in the paper [Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs](https://arxiv.org/abs/2509.13480), accepted at CLiC-it 2025. The dataset is derived from the data originally created to train the gender-neutrality classifier [GeNTE-evaluator](https://huggingface.co/FBK-MT/GeNTE-evaluator). The creation and curation of the original dataset is described in the paper [Hi Guys or Hi Folks? Benchmarking Gender-Neutral Machine Translation with the GeNTE Corpus (Piergentili et al., 2023)](https://aclanthology.org/2023.emnlp-main.873/). Entries in this dataset include the following columns: * **id**: a progressive identifier * **gendered**: the gendered sentence * **neutral**: the gender-neutral sentence To facilitate reproducibility of our paper’s experiments, we release both splits: - **full**: the complete set of 162,778 pairs - **clean**: a subset of 81,389 pairs selected based on their BERTScore These two splits dataset were used to fine-tune the following models: - [Qwen3-8B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-clean) - [Qwen3-8B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-full) - [Qwen3-14B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-clean) - [Qwen3-14B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-full) ## Usage ```python from datasets import load_dataset # Full set full_data = load_dataset("FBK-MT/GNR-it", split="full") # Clean set clean_data = load_dataset("FBK-MT/GNR-it", split="clean") ``` ## License We release this dataset under the Creative Commons Attribution 4.0 International license (CC BY 4.0). ## Citation If you this dataset in your work, please cite: ``` @misc{piergentili2025genderneutralrewritingitalianmodels, title={Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs}, author={Andrea Piergentili and Beatrice Savoldi and Matteo Negri and Luisa Bentivogli}, year={2025}, eprint={2509.13480}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.13480}, } ``` ## Contributions Thanks to [@apiergentili](https://huggingface.co/apiergentili) for adding this dataset.

提供机构：

FBK-MT

5,000+

优质数据集

54 个

任务类型

进入经典数据集