BSC-LT/geneval_catalan
收藏Hugging Face2026-04-09 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/BSC-LT/geneval_catalan
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- ca
license: cc-by-sa-3.0
task_categories:
- translation
tags:
- gender-bias
- evaluation
- catalan
- machine-translation
- geneval
pretty_name: GenEval Catalan
size_categories:
- 1K<n<10K
---
# Dataset Card for GenEval Catalan
## Dataset Summary
GenEval Catalan is an English→Catalan extension of [MT-GenEval](https://github.com/amazon-science/machine-translation-gender-eval), a benchmark for evaluating gender accuracy in machine translation. It is derived from the original English–Spanish MT-GenEval dataset (Currey et al., 2022), with the Spanish side replaced by Catalan translations produced by professional human translators (with the exception of the single-sentence dev split, which uses automatic translation).
The dataset preserves the original evaluation conditions — single sentences and sentences with preceding context — and adds a new **trailing context** condition in which the gendered sentence precedes the context sentence rather than following it. Context splits include three Catalan fields per item: the gender-flipped target sentence (`ca_target`), the original-gender target sentence (`ca_original`), and a full two-sentence translation in original gender (`ca_full`). This supports counterfactual evaluation directly from the dataset without requiring separate gender-flipped test sets.
Due to filtering (sentence pairs where the Catalan translation did not preserve gender marking were discarded), all splits are smaller than the corresponding English–Spanish originals.
## Supported Tasks and Leaderboards
The dataset is intended for evaluating **gender accuracy in English to Catalan machine translation**. It supports:
- Counterfactual evaluation: each context item includes both the gender-flipped (`ca_target`) and original-gender (`ca_original`) translations, enabling direct comparison without constructing separate test sets
- Contextual evaluation: assessing whether surrounding context influences gender agreement
## Languages
| Language | Code |
|---|---|
| English | `en` |
| Catalan | `ca` |
## Dataset Structure
### Data Instances
Each instance is a parallel sentence or sentence pair in English and Catalan. In context splits, the English field contains two sentences separated by `<sep>`.
Single-sentence items occur as counterfactual pairs: each English sentence appears twice in the dataset, once with a masculine and once with a feminine Catalan translation. This mirrors the structure of the original MT-GenEval dataset and enables direct counterfactual comparison.
Example (`single_sentence`):
```json
{
"en": "Mommy blogging is dead, and I think most of my colleagues would agree, she told Vox in 2019.",
"ca": "\"Els blocs de mares estan morts, i crec que la majoria dels meus col·legues hi estarien d'acord\", va dir a Vox el 2019.",
"gender": "feminine",
"split": "dev"
}
```
Example (`context_preceding`):
```json
{
"en": "In 1994–95, he conducted a research project for the New York Times on how to transform the print newspaper into a multimedia publication. <sep> Ritchin is a prolific author and curator, focusing on digital media and the rapid changes occurring in photography.",
"ca_target": "Ritchin és una escriptora i editora prolífica, que s'ha centrat en els mitjans digitals i en els canvis que s'han produït ràpidament en la fotografia.",
"ca_full": "Durant els anys 1994-1995, va dirigir un projecte de recerca per al \"New York Times\" sobre com transformar el diari imprès en una publicació multimèdia. <sep> Ritchin és un escriptor i editor prolífic, que s'ha centrat en els mitjans digitals i en els canvis que s'han produït ràpidament en la fotografia.",
"ca_original": "Ritchin és un escriptor i editor prolífic, que s'ha centrat en els mitjans digitals i en els canvis que s'han produït ràpidament en la fotografia.",
"gender": "masculine"
}
```
Note: in this example, `ca_target` contains the gender-flipped (feminine) translation, while `ca_original` and `ca_full` reflect the original masculine gender. The `gender` field records the **original** gender of the referent.
### Data Fields
**`single_sentence`**
- `en` (string): English sentence — each English sentence appears twice, paired with masculine and feminine Catalan translations respectively
- `ca` (string): Catalan translation (gender matches the `gender` field)
- `gender` (string): Gender of the Catalan translation in this item (`masculine` / `feminine`)
- `split` (string): Dataset split (`dev` / `test`)
**`context_preceding` and `context_trailing`**
- `en` (string): Both English sentences, `<sep>`-separated
- `ca_target` (string): Catalan translation of the gendered sentence only, **gender-flipped** relative to the original
- `ca_original` (string): Catalan translation of the gendered sentence only, **original gender**
- `ca_full` (string): Full Catalan translation of both sentences, original gender
- `gender` (string): Original gender of the target referent (`masculine` / `feminine`)
### Data Splits
| Config | Split | Rows |
|---|---|---|
| `single_sentence` | dev | 1,164 |
| `single_sentence` | test | 300 |
| `context_preceding` | dev | 397 |
| `context_preceding` | test | 764 |
| `context_trailing` | dev | 397 |
| `context_trailing` | test | 764 |
### Configs
| Config | Description |
|---|---|
| `single_sentence` | Single gendered sentences with no surrounding context. Fields: `en`, `ca`, `gender`. |
| `context_preceding` | Gendered sentence preceded by an ungendered context sentence (original MT-GenEval ordering). Fields: `en`, `ca_target` (gender-flipped), `ca_original` (original gender), `ca_full` (full translation, original gender), `gender`. |
| `context_trailing` | Gendered sentence followed by an ungendered context sentence (extended condition). Same fields as `context_preceding`. |
## Dataset Creation
### Curation Rationale
MT-GenEval provides a well-established framework for counterfactual and contextual gender evaluation in MT. This extension makes that framework available for English→Catalan, a language pair not covered by the original dataset. Catalan presents distinct gender agreement patterns that warrant dedicated evaluation resources.
The trailing context condition was added to assess whether models use downstream context to resolve gender, a condition absent from the original benchmark.
### Source Data
#### Initial Data Collection and Normalization
The English source sentences are taken directly from the original MT-GenEval dataset (Currey et al., 2022). The Catalan target sentences were produced as follows:
- **`single_sentence` dev split**: Automatic translation using [`projecte-aina/aina-translator-es-ca`](https://huggingface.co/projecte-aina/aina-translator-es-ca), translating from the Spanish MT-GenEval sentences
- **All other splits**: Professional human translation
Sentence pairs were filtered to retain only those where the Catalan translation preserves the gender marking of the original referent. This filtering reduces split sizes relative to the English–Spanish originals.
For the trailing context condition, items were filtered to those where the gendered sentence appears naturally in either position; counts therefore differ from the preceding context splits.
#### Who are the source language producers?
English source sentences: Amazon Science / MT-GenEval authors (Currey et al., 2022)
Catalan translations: Professional translators at the Machine Translation Group, Language Technologies Lab, Barcelona Supercomputing Center
### Annotations
#### Annotation Process
No additional annotations were applied beyond those present in the original MT-GenEval dataset (gender labels). Catalan translations were produced by professional translators and reviewed for correctness of gender marking.
#### Who are the annotators?
Machine Translation Group, Language Technologies Lab, Barcelona Supercomputing Center
### Personal and Sensitive Information
The source sentences are constructed or drawn from Wikipedia and are not associated with real individuals. No personal or sensitive information is expected to be present in this dataset.
## Considerations for Using the Data
### Social Impact of Dataset
This dataset supports the development and evaluation of gender-fair machine translation for Catalan. Making this resource publicly available lowers the barrier for researchers and developers working on bias mitigation in Catalan MT systems.
### Discussion of Biases
The dataset is designed to probe binary grammatical gender (masculine/feminine) as encoded in Catalan morphology. It does not cover non-binary gender expressions or cases where gender is unmarked. Evaluation results reflect a model's ability to propagate explicit gender cues, not a comprehensive measure of gender fairness.
The `single_sentence` dev split uses automatic translation from Spanish, which may introduce artefacts or errors not present in the human-translated splits.
### Other Known Limitations
- All splits are smaller than the corresponding English–Spanish MT-GenEval splits due to gender-marking filtering.
- The dataset targets grammatical gender agreement and is not a measure of social or representational bias more broadly.
- The trailing context condition uses the same items as the preceding context condition, filtered for natural positional fit; results across the two conditions are not directly comparable in terms of item composition.
## Additional Information
### Dataset Curators
Machine Translation Group, Language Technologies Lab, Barcelona Supercomputing Center ([langtech@bsc.es](mailto:langtech@bsc.es))
### Funding
This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
This work was funded by the Ministerio para la Transformación Digital y de la Función Pública — Funded by EU – NextGenerationEU within the framework of the ALIA project.
### Licensing Information
This dataset is released under [Creative Commons Attribution Share Alike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) (CC BY-SA 3.0), following the licence of the original MT-GenEval dataset.
### Citation Information
If you use this dataset, please cite the original MT-GenEval paper:
```bibtex
@inproceedings{currey-etal-2022-mtgeneval,
title = "{MT-GenEval}: {A} Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation",
author = "Currey, Anna and
Nadejde, Maria and
Pappagari, Raghavendra and
Mayer, Mia and
Lauly, Stanislas and
Niu, Xing and
Hsu, Benjamin and
Dinu, Georgiana",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2211.01355"
}
```
### Contributions
Dataset extension and Catalan translations: Machine Translation Group, Language Technologies Lab, Barcelona Supercomputing Center
## Usage
```python
from datasets import load_dataset
# Single sentence
ds = load_dataset("LangTech-MT/geneval_catalan", "single_sentence")
# Context — gendered sentence preceded by context
ds = load_dataset("LangTech-MT/geneval_catalan", "context_preceding")
# Context — gendered sentence followed by context
ds = load_dataset("LangTech-MT/geneval_catalan", "context_trailing")
```
提供机构:
BSC-LT



