yandex/wmt24-en-ru-rate
收藏Hugging Face2025-11-06 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/yandex/wmt24-en-ru-rate
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# RATE Framework Annotation Dataset
This repository contains the annotation data for the paper [Refined Assessment for Translation Evaluation: Rethinking Machine Translation Evaluation in the Era of Human-Level Systems](https://aclanthology.org/2025.findings-emnlp.1203/). The dataset presents human evaluation of English-Russian translations from WMT24, annotated using our proposed RATE framework by highly qualified professional translators and linguists with academic degrees and substantial industry experience. For comparison purposes, the same content was also annotated using the traditional ESA (Error Span Annotation) approach by a broader pool of annotators with verified strong language proficiency (native Russian speakers with C1 English level).
## Dataset Description
The dataset contains three files:
- `data/rate_spans.jsonl` — RATE-style annotations performed by professional translators with linguistics or translation degrees, substantial industry experience, and verified language proficiency.
- `data/esa_spans.overlap_1.jsonl` — ESA-style annotations by a wide group of in-house annotators who are native Russian speakers with verified C1 English proficiency but without specific academic degrees or translation experience.
- `data/esa_spans.overlap_3.jsonl` — ESA-style annotations with 3x overlap (each segment annotated by 3 different annotators), see Appendix A in the original paper for details.
## Data Structure
All three files share the same structure with the following fields:
### Document and Translation Information
- `annotator` — Unique identifier for the annotator
- `doc_id` — Original document ID (`doc_id`) from WMT24
- `doc_num` — Document number (one-to-one correspondence with doc_id, introduced for convenience)
- `domain` — Original domain (`domain`) from WMT24
- `langs` — Original language direction (`langs`) from WMT24 (always "en-ru")
- `line_id` — Original line ID (`line_id`) from WMT24
- `segment_num` — Segment number within a document (zero-indexed)
- `speech_info` — Original speech info (`speech_info`) from WMT24
- `system` — Machine translation system used for translation
- `src` — Source text
- `tgt` — Target text (translation)
### Annotation Data
#### For RATE annotations:
- `score_accuracy` — Accuracy score of the translation (1-100)
- `score_fluency` — Fluency score of the translation (1-100)
- `score_style` — Style score of the translation (1-100)
- `spans` — Error spans in the translation (see structure below)
#### For ESA annotations:
- `esa_score` — Overall translation score (1-100)
- `spans` — Error spans in the translation (see structure below)
### Error Spans Format
Each error span is a JSON object with the following properties:
- `start_i`, `end_i` — Start and end indices of the error span
- `error_text` — Text of the error, a substring of the full text extracted using the provided indices
- `error_source` — Source of the error:
- "src" — Error span in the source text (rarely used, typically to highlight omitted parts of the source)
- "trn" — Error span in the translation (most common)
- `error_type` — Type of error:
- For RATE annotations: specific error types as described in the original paper
- For ESA annotations: "ANY_TYPE" placeholder (as ESA does not specify error types)
- `error_comment` — Annotator's comment:
- For RATE: explanatory text from the annotator
- For ESA: empty string (as annotators were not asked to provide comments by design)
- `severity` — Error severity:
- For RATE: integer from 1 to 5
- For ESA: 3 (for minor errors) or 5 (for major errors)
## Citation
If you use this dataset in your research, please cite:
```
@inproceedings{popov-etal-2025-refined,
title = "Refined Assessment for Translation Evaluation: Rethinking Machine Translation Evaluation in the Era of Human-Level Systems",
author = "Popov, Dmitry and
Negodin, Vladislav and
Enikeeva, Ekaterina and
Matrosova, Iana and
Karpachev, Nikolay and
Ryabinin, Max",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
url = "https://aclanthology.org/2025.findings-emnlp.1203/",
pages = "22079--22095"
}
```
## License
This dataset is licensed under the Apache License, Version 2.0.
提供机构:
yandex



