five

FBK-MT/GNR-it

收藏
Hugging Face2025-09-18 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/FBK-MT/GNR-it
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 configs: - config_name: default data_files: - split: clean path: data/clean-* - split: full path: data/full-* dataset_info: features: - name: id dtype: int64 - name: gendered dtype: string - name: neutral dtype: string splits: - name: clean num_bytes: 25496599 num_examples: 81389 - name: full num_bytes: 50772270 num_examples: 162778 download_size: 28044975 dataset_size: 76268869 task_categories: - text-classification - text-generation language: - it tags: - fairness - rewriting - gender-inclusive - gender-neutral size_categories: - 100K<n<1M --- # GNR-it Dataset ## Table of Contents 1. [Overview](#overview) 2. [Usage](#usage) 3. [License](#license) 4. [Citation](#citation) ## Overview The **GNR-it** dataset contains pairs of gendered and gender-neutral Italian sentences. We release this dataset to ensure reproducibility of the experiments in the paper [Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs](https://arxiv.org/abs/2509.13480), accepted at CLiC-it 2025. The dataset is derived from the data originally created to train the gender-neutrality classifier [GeNTE-evaluator](https://huggingface.co/FBK-MT/GeNTE-evaluator). The creation and curation of the original dataset is described in the paper [Hi Guys or Hi Folks? Benchmarking Gender-Neutral Machine Translation with the GeNTE Corpus (Piergentili et al., 2023)](https://aclanthology.org/2023.emnlp-main.873/). Entries in this dataset include the following columns: * **id**: a progressive identifier * **gendered**: the gendered sentence * **neutral**: the gender-neutral sentence To facilitate reproducibility of our paper’s experiments, we release both splits: - **full**: the complete set of 162,778 pairs - **clean**: a subset of 81,389 pairs selected based on their BERTScore These two splits dataset were used to fine-tune the following models: - [Qwen3-8B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-clean) - [Qwen3-8B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-full) - [Qwen3-14B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-clean) - [Qwen3-14B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-full) ## Usage ```python from datasets import load_dataset # Full set full_data = load_dataset("FBK-MT/GNR-it", split="full") # Clean set clean_data = load_dataset("FBK-MT/GNR-it", split="clean") ``` ## License We release this dataset under the Creative Commons Attribution 4.0 International license (CC BY 4.0). ## Citation If you this dataset in your work, please cite: ``` @misc{piergentili2025genderneutralrewritingitalianmodels, title={Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs}, author={Andrea Piergentili and Beatrice Savoldi and Matteo Negri and Luisa Bentivogli}, year={2025}, eprint={2509.13480}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.13480}, } ``` ## Contributions Thanks to [@apiergentili](https://huggingface.co/apiergentili) for adding this dataset.
提供机构:
FBK-MT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作