five

yandex/wmt24-en-ru-rate

收藏
Hugging Face2025-11-06 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/yandex/wmt24-en-ru-rate
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 --- # RATE Framework Annotation Dataset This repository contains the annotation data for the paper [Refined Assessment for Translation Evaluation: Rethinking Machine Translation Evaluation in the Era of Human-Level Systems](https://aclanthology.org/2025.findings-emnlp.1203/). The dataset presents human evaluation of English-Russian translations from WMT24, annotated using our proposed RATE framework by highly qualified professional translators and linguists with academic degrees and substantial industry experience. For comparison purposes, the same content was also annotated using the traditional ESA (Error Span Annotation) approach by a broader pool of annotators with verified strong language proficiency (native Russian speakers with C1 English level). ## Dataset Description The dataset contains three files: - `data/rate_spans.jsonl` — RATE-style annotations performed by professional translators with linguistics or translation degrees, substantial industry experience, and verified language proficiency. - `data/esa_spans.overlap_1.jsonl` — ESA-style annotations by a wide group of in-house annotators who are native Russian speakers with verified C1 English proficiency but without specific academic degrees or translation experience. - `data/esa_spans.overlap_3.jsonl` — ESA-style annotations with 3x overlap (each segment annotated by 3 different annotators), see Appendix A in the original paper for details. ## Data Structure All three files share the same structure with the following fields: ### Document and Translation Information - `annotator` — Unique identifier for the annotator - `doc_id` — Original document ID (`doc_id`) from WMT24 - `doc_num` — Document number (one-to-one correspondence with doc_id, introduced for convenience) - `domain` — Original domain (`domain`) from WMT24 - `langs` — Original language direction (`langs`) from WMT24 (always "en-ru") - `line_id` — Original line ID (`line_id`) from WMT24 - `segment_num` — Segment number within a document (zero-indexed) - `speech_info` — Original speech info (`speech_info`) from WMT24 - `system` — Machine translation system used for translation - `src` — Source text - `tgt` — Target text (translation) ### Annotation Data #### For RATE annotations: - `score_accuracy` — Accuracy score of the translation (1-100) - `score_fluency` — Fluency score of the translation (1-100) - `score_style` — Style score of the translation (1-100) - `spans` — Error spans in the translation (see structure below) #### For ESA annotations: - `esa_score` — Overall translation score (1-100) - `spans` — Error spans in the translation (see structure below) ### Error Spans Format Each error span is a JSON object with the following properties: - `start_i`, `end_i` — Start and end indices of the error span - `error_text` — Text of the error, a substring of the full text extracted using the provided indices - `error_source` — Source of the error: - "src" — Error span in the source text (rarely used, typically to highlight omitted parts of the source) - "trn" — Error span in the translation (most common) - `error_type` — Type of error: - For RATE annotations: specific error types as described in the original paper - For ESA annotations: "ANY_TYPE" placeholder (as ESA does not specify error types) - `error_comment` — Annotator's comment: - For RATE: explanatory text from the annotator - For ESA: empty string (as annotators were not asked to provide comments by design) - `severity` — Error severity: - For RATE: integer from 1 to 5 - For ESA: 3 (for minor errors) or 5 (for major errors) ## Citation If you use this dataset in your research, please cite: ``` @inproceedings{popov-etal-2025-refined, title = "Refined Assessment for Translation Evaluation: Rethinking Machine Translation Evaluation in the Era of Human-Level Systems", author = "Popov, Dmitry and Negodin, Vladislav and Enikeeva, Ekaterina and Matrosova, Iana and Karpachev, Nikolay and Ryabinin, Max", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", month = nov, year = "2025", url = "https://aclanthology.org/2025.findings-emnlp.1203/", pages = "22079--22095" } ``` ## License This dataset is licensed under the Apache License, Version 2.0.
提供机构:
yandex
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作