TransEvalnia
收藏魔搭社区2026-01-06 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/SakanaAI/TransEvalnia
下载链接
链接失效反馈官方服务:
资源简介:
# TransEvalnia dataset
Paper: [arxiv](https://arxiv.org/abs/2507.12724) | Github: [SakanaAI/TransEvalnia](https://github.com/SakanaAI/TransEvalnia)
## Introduction
**[TransEvalnia](#)** is a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This repo presents the dataset used in the work.
<img src="https://cdn-uploads.huggingface.co/production/uploads/605aebdece105fbcadcb8f3d/owp3vZDRul1406bXuEt8N.png" width="800">
The dataset consists of two parts.
The **with_human_ranking** part includes 3,000 translation triplets with human scores. The data was mainly used for evaluating ranking accuracy.
The **human_verification** part includes 800 model-generated evaluations and their verification from human annotators. The data was mainly used for meta-evaluation.
## The *with_human_ranking* data
The **with_human_ranking** data contains over 3,000 translation triplets (src, tgt1, tgt2) from the following 7 data sources and their reasoning-based evaluations from Qwen2.5-72B-Instruct and Claude Sonnet 3.5.
* `hard en-ja`: 47 English-Japanese translation triplets curated by expert translators
* `wmt 2021 en-ja`: 500 English-Japanese translation triplets from WMT 2021 DA
* `wmt 2021 ja-en`: 500 Japanese-English translation triplets from WMT 2021 DA
* `wmt 2022 en-ru`: 497 English-Russian translation triplets from WMT 2022 MQM
* `wmt 2023 en-de`: 500 English-German translation triplets from WMT 2023 MQM
* `wmt 2023 zh-en`: 500 Chinese-English translation triplets from WMT 2023 MQM
* `wmt 2024 en-es`: 499 English-Spanish translation triplets from WMT 2024 MQM
Every data item has the following fields:
- `dataset: str` - Dataset name.
- `src_text: str` - Source text.
- `tgt_texts: list[str]` - Texts of two translations.
- `src_lang: str` - Source language code (e.g. ja).
- `tgt_lang: str` - Target language code (e.g. en).
- `src_lang_long: str` - Source language full name (e.g. Japanese).
- `src_lang_long: str` - Source language full name (e.g. English).
- `human_scores: list[float]` - Ground-truth human scores of the two translations.
- For `hard en-ja`, a human score is between [0, 10]. A higher score indicates higher translation quality.
- For datasets sourced from WMT DA, a human score is the z-score of the original direct assessment rating. A higher score indicates higher translation quality.
- For datasets sourced from WMT MQM, a human score is the negated MQM score. A higher score indicates higher translation quality.
- `one_step_ranking/{model}: str` - Model generated ranking decision using the one-step method.
- `dim_evals/{model}: list[str]` - Model generated dimensional evaluations for each of the two translations.
- `two_step_ranking/{model}: str` - Model generated ranking decision based on `dim_evals/{model}`, using the two-step method.
- `two_step_scoring/{model}: str` - Model generated scoring decision based on `dim_evals/{model}`, using the two-step method.
- `interleaved_dim_evals: str` - Model generated interleaved dimensional evaluations, based on `dim_evals/{model}`, using the three-step method.
- `three_step_ranking/{model}: str` - Model generated ranking decision based on `interleaved_dim_evals/{model}`, using the three-step method.
The dependency between the fields can be visualized as following.
```
src_text
tgt_texts
src_lang
tgt_lang
├── one_step_ranking/{model}
└── dim_evals/{model}
├── two_step_ranking/{model}
├── two_step_scoring/{model}
└── interleaved_dim_evals/{model}
└── three_step_ranking/{model}
```
Note: `{model}` can be `qwen` or `claude`.
## The *human_verification* data
The **human_verification** data contains 800 model-generated evaluations and their human verifications. The verifications were collected from two translation service vendors.
* `human_verification-generic_claude-vendor1`: Claude Sonnet 3.5's evaluations of 200 translations from multiple domains. Annotated by vendor 1.
* `human_verification-generic_claude-vendor2`: Claude Sonnet 3.5's evaluations of 200 translations from multiple domains. Annotated by vendor 2.
* `human_verification-generic_qwen-vendor2`: Qwen2.5-72B-Instruct's evaluations of 200 translations from multiple domains. Annotated by vendor 2.
* `human_verification-haiku_qwen-vendor2`: Qwen2.5-72B-Instruct's evaluations of 200 Haiku translations. Annotated by vendor 2.
Every data item has the following fields:
- `idx: int` - Data index.
- `source: str` - Source text.
- `translation: str` - Translation text.
- `evaluation: str` - Model-generated evaluation.
- `annotations: dict` - Human verification of the model-generated evaluation. Data provided by different vendors have different structures.
- `system: str` - Model that was used to generate the translation.
## Citation
```
@misc{sproat2025transevalniareasoningbasedevaluationranking,
title={TransEvalnia: Reasoning-based Evaluation and Ranking of Translations},
author={Richard Sproat and Tianyu Zhao and Llion Jones},
year={2025},
eprint={2507.12724},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.12724},
}
```
# TransEvalnia 数据集
论文:[arxiv](https://arxiv.org/abs/2507.12724) | GitHub:[SakanaAI/TransEvalnia](https://github.com/SakanaAI/TransEvalnia)
## 简介
**[TransEvalnia](#)** 是一款基于提示词的翻译评估与排序系统,其评估与排序流程内嵌推理机制。本仓库即为本研究工作所配套的数据集。
<img src="https://cdn-uploads.huggingface.co/production/uploads/605aebdece105fbcadcb8f3d/owp3vZDRul1406bXuEt8N.png" width="800">
本数据集包含两个核心部分:
1. **带人类标注排序(with_human_ranking)**:包含3000组带人类评分的翻译三元组,主要用于评估排序模型的准确率。
2. **人类验证(human_verification)**:包含800个模型生成的翻译评估结果及人类标注者的验证结果,主要用于元评估(meta-evaluation)。
## 带人类标注排序(with_human_ranking)数据集
该数据集包含来自7个数据源的逾3000组翻译三元组(源文本src、目标译文tgt1、tgt2),以及Qwen2.5-72B-Instruct和Claude Sonnet 3.5生成的基于推理的评估结果,各数据源详情如下:
* `hard en-ja`:47组由专业译员精心整理的英-日翻译三元组
* `wmt 2021 en-ja`:500组来自WMT 2021直接评估(Direct Assessment, DA)的英-日翻译三元组
* `wmt 2021 ja-en`:500组来自WMT 2021直接评估(DA)的日-英翻译三元组
* `wmt 2022 en-ru`:497组来自WMT 2022多维度质量评估(Multidimensional Quality Metric, MQM)的英-俄翻译三元组
* `wmt 2023 en-de`:500组来自WMT 2023 MQM的英-德翻译三元组
* `wmt 2023 zh-en`:500组来自WMT 2023 MQM的中-英翻译三元组
* `wmt 2024 en-es`:499组来自WMT 2024 MQM的英-西翻译三元组
每个数据项包含以下字段:
- `dataset: str`:数据集名称。
- `src_text: str`:源文本。
- `tgt_texts: list[str]`:两组译文文本。
- `src_lang: str`:源语言代码(例如`ja`)。
- `tgt_lang: str`:目标语言代码(例如`en`)。
- `src_lang_long: str`:源语言全称(例如`Japanese`)。
- `tgt_lang_long: str`:目标语言全称(例如`English`)。
- `human_scores: list[float]`:两组译文的真实人类评分。
- 针对`hard en-ja`数据集,人类评分范围为[0, 10],分值越高代表翻译质量越好。
- 针对WMT DA来源的数据集,人类评分为原始直接评估评分的z分数,分值越高代表翻译质量越好。
- 针对WMT MQM来源的数据集,人类评分为取反后的MQM评分,分值越高代表翻译质量越好。
- `one_step_ranking/{model}: str`:采用单步方法生成的模型排序决策。
- `dim_evals/{model}: list[str]`:针对两组译文,模型生成的维度化评估结果。
- `two_step_ranking/{model}: str`:基于`dim_evals/{model}`,采用两步方法生成的模型排序决策。
- `two_step_scoring/{model}: str`:基于`dim_evals/{model}`,采用两步方法生成的模型评分决策。
- `interleaved_dim_evals: str`:基于`dim_evals/{model}`,采用三步方法生成的交错式维度化评估结果。
- `three_step_ranking/{model}: str`:基于`interleaved_dim_evals/{model}`,采用三步方法生成的模型排序决策。
各字段间的依赖关系可可视化如下:
src_text
tgt_texts
src_lang
tgt_lang
├── one_step_ranking/{model}
└── dim_evals/{model}
├── two_step_ranking/{model}
├── two_step_scoring/{model}
└── interleaved_dim_evals/{model}
└── three_step_ranking/{model}
注:`{model}` 可取值为`qwen` 或 `claude`。
## 人类验证(human_verification)数据集
该数据集包含800个模型生成的翻译评估结果及对应的人类验证结果,验证结果来自两家翻译服务供应商:
* `human_verification-generic_claude-vendor1`:Claude Sonnet 3.5对多领域200组翻译生成的评估结果,由供应商1标注。
* `human_verification-generic_claude-vendor2`:Claude Sonnet 3.5对多领域200组翻译生成的评估结果,由供应商2标注。
* `human_verification-generic_qwen-vendor2`:Qwen2.5-72B-Instruct对多领域200组翻译生成的评估结果,由供应商2标注。
* `human_verification-haiku_qwen-vendor2`:Qwen2.5-72B-Instruct对200组俳句翻译生成的评估结果,由供应商2标注。
每个数据项包含以下字段:
- `idx: int`:数据索引。
- `source: str`:源文本。
- `translation: str`:译文文本。
- `evaluation: str`:模型生成的评估结果。
- `annotations: dict`:对模型生成评估结果的人类验证结果,不同供应商提供的数据结构存在差异。
- `system: str`:用于生成译文的模型。
## 引用
@misc{sproat2025transevalniareasoningbasedevaluationranking,
title={TransEvalnia: Reasoning-based Evaluation and Ranking of Translations},
author={Richard Sproat and Tianyu Zhao and Llion Jones},
year={2025},
eprint={2507.12724},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.12724},
}
提供机构:
maas
创建时间:
2025-07-19



