TransEvalnia

Name: TransEvalnia
Creator: maas
Published: 2026-01-06 16:39:20
License: 暂无描述

魔搭社区2026-01-06 更新2025-07-26 收录

下载链接：

https://modelscope.cn/datasets/SakanaAI/TransEvalnia

下载链接

链接失效反馈

官方服务：

资源简介：

# TransEvalnia dataset Paper: [arxiv](https://arxiv.org/abs/2507.12724) | Github: [SakanaAI/TransEvalnia](https://github.com/SakanaAI/TransEvalnia) ## Introduction **[TransEvalnia](#)** is a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This repo presents the dataset used in the work. <img src="https://cdn-uploads.huggingface.co/production/uploads/605aebdece105fbcadcb8f3d/owp3vZDRul1406bXuEt8N.png" width="800"> The dataset consists of two parts. The **with_human_ranking** part includes 3,000 translation triplets with human scores. The data was mainly used for evaluating ranking accuracy. The **human_verification** part includes 800 model-generated evaluations and their verification from human annotators. The data was mainly used for meta-evaluation. ## The *with_human_ranking* data The **with_human_ranking** data contains over 3,000 translation triplets (src, tgt1, tgt2) from the following 7 data sources and their reasoning-based evaluations from Qwen2.5-72B-Instruct and Claude Sonnet 3.5. * `hard en-ja`: 47 English-Japanese translation triplets curated by expert translators * `wmt 2021 en-ja`: 500 English-Japanese translation triplets from WMT 2021 DA * `wmt 2021 ja-en`: 500 Japanese-English translation triplets from WMT 2021 DA * `wmt 2022 en-ru`: 497 English-Russian translation triplets from WMT 2022 MQM * `wmt 2023 en-de`: 500 English-German translation triplets from WMT 2023 MQM * `wmt 2023 zh-en`: 500 Chinese-English translation triplets from WMT 2023 MQM * `wmt 2024 en-es`: 499 English-Spanish translation triplets from WMT 2024 MQM Every data item has the following fields: - `dataset: str` - Dataset name. - `src_text: str` - Source text. - `tgt_texts: list[str]` - Texts of two translations. - `src_lang: str` - Source language code (e.g. ja). - `tgt_lang: str` - Target language code (e.g. en). - `src_lang_long: str` - Source language full name (e.g. Japanese). - `src_lang_long: str` - Source language full name (e.g. English). - `human_scores: list[float]` - Ground-truth human scores of the two translations. - For `hard en-ja`, a human score is between [0, 10]. A higher score indicates higher translation quality. - For datasets sourced from WMT DA, a human score is the z-score of the original direct assessment rating. A higher score indicates higher translation quality. - For datasets sourced from WMT MQM, a human score is the negated MQM score. A higher score indicates higher translation quality. - `one_step_ranking/{model}: str` - Model generated ranking decision using the one-step method. - `dim_evals/{model}: list[str]` - Model generated dimensional evaluations for each of the two translations. - `two_step_ranking/{model}: str` - Model generated ranking decision based on `dim_evals/{model}`, using the two-step method. - `two_step_scoring/{model}: str` - Model generated scoring decision based on `dim_evals/{model}`, using the two-step method. - `interleaved_dim_evals: str` - Model generated interleaved dimensional evaluations, based on `dim_evals/{model}`, using the three-step method. - `three_step_ranking/{model}: str` - Model generated ranking decision based on `interleaved_dim_evals/{model}`, using the three-step method. The dependency between the fields can be visualized as following. ``` src_text tgt_texts src_lang tgt_lang ├── one_step_ranking/{model} └── dim_evals/{model} ├── two_step_ranking/{model} ├── two_step_scoring/{model} └── interleaved_dim_evals/{model} └── three_step_ranking/{model} ``` Note: `{model}` can be `qwen` or `claude`. ## The *human_verification* data The **human_verification** data contains 800 model-generated evaluations and their human verifications. The verifications were collected from two translation service vendors. * `human_verification-generic_claude-vendor1`: Claude Sonnet 3.5's evaluations of 200 translations from multiple domains. Annotated by vendor 1. * `human_verification-generic_claude-vendor2`: Claude Sonnet 3.5's evaluations of 200 translations from multiple domains. Annotated by vendor 2. * `human_verification-generic_qwen-vendor2`: Qwen2.5-72B-Instruct's evaluations of 200 translations from multiple domains. Annotated by vendor 2. * `human_verification-haiku_qwen-vendor2`: Qwen2.5-72B-Instruct's evaluations of 200 Haiku translations. Annotated by vendor 2. Every data item has the following fields: - `idx: int` - Data index. - `source: str` - Source text. - `translation: str` - Translation text. - `evaluation: str` - Model-generated evaluation. - `annotations: dict` - Human verification of the model-generated evaluation. Data provided by different vendors have different structures. - `system: str` - Model that was used to generate the translation. ## Citation ``` @misc{sproat2025transevalniareasoningbasedevaluationranking, title={TransEvalnia: Reasoning-based Evaluation and Ranking of Translations}, author={Richard Sproat and Tianyu Zhao and Llion Jones}, year={2025}, eprint={2507.12724}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.12724}, } ```

# TransEvalnia 数据集论文：[arxiv](https://arxiv.org/abs/2507.12724) | GitHub：[SakanaAI/TransEvalnia](https://github.com/SakanaAI/TransEvalnia) ## 简介 **[TransEvalnia](#)** 是一款基于提示词的翻译评估与排序系统，其评估与排序流程内嵌推理机制。本仓库即为本研究工作所配套的数据集。 <img src="https://cdn-uploads.huggingface.co/production/uploads/605aebdece105fbcadcb8f3d/owp3vZDRul1406bXuEt8N.png" width="800"> 本数据集包含两个核心部分： 1. **带人类标注排序（with_human_ranking）**：包含3000组带人类评分的翻译三元组，主要用于评估排序模型的准确率。 2. **人类验证（human_verification）**：包含800个模型生成的翻译评估结果及人类标注者的验证结果，主要用于元评估（meta-evaluation）。 ## 带人类标注排序（with_human_ranking）数据集该数据集包含来自7个数据源的逾3000组翻译三元组（源文本src、目标译文tgt1、tgt2），以及Qwen2.5-72B-Instruct和Claude Sonnet 3.5生成的基于推理的评估结果，各数据源详情如下： * `hard en-ja`：47组由专业译员精心整理的英-日翻译三元组 * `wmt 2021 en-ja`：500组来自WMT 2021直接评估（Direct Assessment, DA）的英-日翻译三元组 * `wmt 2021 ja-en`：500组来自WMT 2021直接评估（DA）的日-英翻译三元组 * `wmt 2022 en-ru`：497组来自WMT 2022多维度质量评估（Multidimensional Quality Metric, MQM）的英-俄翻译三元组 * `wmt 2023 en-de`：500组来自WMT 2023 MQM的英-德翻译三元组 * `wmt 2023 zh-en`：500组来自WMT 2023 MQM的中-英翻译三元组 * `wmt 2024 en-es`：499组来自WMT 2024 MQM的英-西翻译三元组每个数据项包含以下字段： - `dataset: str`：数据集名称。 - `src_text: str`：源文本。 - `tgt_texts: list[str]`：两组译文文本。 - `src_lang: str`：源语言代码（例如`ja`）。 - `tgt_lang: str`：目标语言代码（例如`en`）。 - `src_lang_long: str`：源语言全称（例如`Japanese`）。 - `tgt_lang_long: str`：目标语言全称（例如`English`）。 - `human_scores: list[float]`：两组译文的真实人类评分。 - 针对`hard en-ja`数据集，人类评分范围为[0, 10]，分值越高代表翻译质量越好。 - 针对WMT DA来源的数据集，人类评分为原始直接评估评分的z分数，分值越高代表翻译质量越好。 - 针对WMT MQM来源的数据集，人类评分为取反后的MQM评分，分值越高代表翻译质量越好。 - `one_step_ranking/{model}: str`：采用单步方法生成的模型排序决策。 - `dim_evals/{model}: list[str]`：针对两组译文，模型生成的维度化评估结果。 - `two_step_ranking/{model}: str`：基于`dim_evals/{model}`，采用两步方法生成的模型排序决策。 - `two_step_scoring/{model}: str`：基于`dim_evals/{model}`，采用两步方法生成的模型评分决策。 - `interleaved_dim_evals: str`：基于`dim_evals/{model}`，采用三步方法生成的交错式维度化评估结果。 - `three_step_ranking/{model}: str`：基于`interleaved_dim_evals/{model}`，采用三步方法生成的模型排序决策。各字段间的依赖关系可可视化如下： src_text tgt_texts src_lang tgt_lang ├── one_step_ranking/{model} └── dim_evals/{model} ├── two_step_ranking/{model} ├── two_step_scoring/{model} └── interleaved_dim_evals/{model} └── three_step_ranking/{model} 注：`{model}` 可取值为`qwen` 或 `claude`。 ## 人类验证（human_verification）数据集该数据集包含800个模型生成的翻译评估结果及对应的人类验证结果，验证结果来自两家翻译服务供应商： * `human_verification-generic_claude-vendor1`：Claude Sonnet 3.5对多领域200组翻译生成的评估结果，由供应商1标注。 * `human_verification-generic_claude-vendor2`：Claude Sonnet 3.5对多领域200组翻译生成的评估结果，由供应商2标注。 * `human_verification-generic_qwen-vendor2`：Qwen2.5-72B-Instruct对多领域200组翻译生成的评估结果，由供应商2标注。 * `human_verification-haiku_qwen-vendor2`：Qwen2.5-72B-Instruct对200组俳句翻译生成的评估结果，由供应商2标注。每个数据项包含以下字段： - `idx: int`：数据索引。 - `source: str`：源文本。 - `translation: str`：译文文本。 - `evaluation: str`：模型生成的评估结果。 - `annotations: dict`：对模型生成评估结果的人类验证结果，不同供应商提供的数据结构存在差异。 - `system: str`：用于生成译文的模型。 ## 引用 @misc{sproat2025transevalniareasoningbasedevaluationranking, title={TransEvalnia: Reasoning-based Evaluation and Ranking of Translations}, author={Richard Sproat and Tianyu Zhao and Llion Jones}, year={2025}, eprint={2507.12724}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.12724}, }

提供机构：

maas

创建时间：

2025-07-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集