GNR-it
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/FBK-MT/GNR-it
下载链接
链接失效反馈官方服务:
资源简介:
# GNR-it Dataset
## Table of Contents
1. [Overview](#overview)
2. [Usage](#usage)
3. [License](#license)
4. [Citation](#citation)
## Overview
The **GNR-it** dataset contains pairs of gendered and gender-neutral Italian sentences.
We release this dataset to ensure reproducibility of the experiments in the paper [Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs](https://arxiv.org/abs/2509.13480), accepted at CLiC-it 2025.
The dataset is derived from the data originally created to train the gender-neutrality classifier [GeNTE-evaluator](https://huggingface.co/FBK-MT/GeNTE-evaluator).
The creation and curation of the original dataset is described in the paper [Hi Guys or Hi Folks? Benchmarking Gender-Neutral Machine Translation with the GeNTE Corpus (Piergentili et al., 2023)](https://aclanthology.org/2023.emnlp-main.873/).
Entries in this dataset include the following columns:
* **id**: a progressive identifier
* **gendered**: the gendered sentence
* **neutral**: the gender-neutral sentence
To facilitate reproducibility of our paper’s experiments, we release both splits:
- **full**: the complete set of 162,778 pairs
- **clean**: a subset of 81,389 pairs selected based on their BERTScore
These two splits dataset were used to fine-tune the following models:
- [Qwen3-8B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-clean)
- [Qwen3-8B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-full)
- [Qwen3-14B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-clean)
- [Qwen3-14B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-full)
## Usage
```python
from datasets import load_dataset
# Full set
full_data = load_dataset("FBK-MT/GNR-it", split="full")
# Clean set
clean_data = load_dataset("FBK-MT/GNR-it", split="clean")
```
## License
We release this dataset under the Creative Commons Attribution 4.0 International license (CC BY 4.0).
## Citation
If you this dataset in your work, please cite:
```
@misc{piergentili2025genderneutralrewritingitalianmodels,
title={Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs},
author={Andrea Piergentili and Beatrice Savoldi and Matteo Negri and Luisa Bentivogli},
year={2025},
eprint={2509.13480},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.13480},
}
```
## Contributions
Thanks to [@apiergentili](https://huggingface.co/apiergentili) for adding this dataset.
# GNR-it 数据集
## 目录
1. [概述](#overview)
2. [使用方法](#usage)
3. [许可协议](#license)
4. [引用](#citation)
## 概述
**GNR-it** 数据集包含成对的意大利语性别化语句与中性语句。
我们发布本数据集以复现论文《意大利语中性性别改写:模型、方法与权衡》(已被CLiC-it 2025收录,预印本链接:https://arxiv.org/abs/2509.13480)中的实验结果。
本数据集源自最初用于训练性别中性分类器 GeNTE-evaluator(https://huggingface.co/FBK-MT/GeNTE-evaluator)的原始数据,原始数据集的构建与整理流程可参考论文《嗨,伙计们还是嗨,各位?利用GeNTE语料库评测性别中性机器翻译》(Piergentili 等,2023,链接:https://aclanthology.org/2023.emnlp-main.873/)。
本数据集的每条条目包含以下字段:
* **id**:渐进式唯一标识符
* **gendered**:性别化语句(gendered)
* **neutral**:中性语句(neutral)
为便于复现本论文的实验,我们发布了两个数据划分版本:
- **full**:完整数据集,包含162,778条语句对
- **clean**:清洗后数据集,基于BERTScore筛选得到的81,389条语句对子集
上述两个数据划分版本被用于微调以下模型:
- [Qwen3-8B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-clean)
- [Qwen3-8B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-8B-GNR-it-full)
- [Qwen3-14B-GNR-it-clean](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-clean)
- [Qwen3-14B-GNR-it-full](https://huggingface.co/FBK-MT/Qwen3-14B-GNR-it-full)
## 使用方法
python
from datasets import load_dataset
# 完整数据集
full_data = load_dataset("FBK-MT/GNR-it", split="full")
# 清洗后数据集
clean_data = load_dataset("FBK-MT/GNR-it", split="clean")
## 许可协议
本数据集采用知识共享署名4.0国际许可协议(CC BY 4.0)发布。
## 引用
若在研究工作中使用本数据集,请引用以下文献:
bibtex
@misc{piergentili2025genderneutralrewritingitalianmodels,
title={Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs},
author={Andrea Piergentili and Beatrice Savoldi and Matteo Negri and Luisa Bentivogli},
year={2025},
eprint={2509.13480},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.13480},
}
## 贡献
感谢 [@apiergentili](https://huggingface.co/apiergentili) 为本数据集添加至Hugging Face Hub。
提供机构:
maas
创建时间:
2025-09-26



