five

TransEvalnia

收藏
魔搭社区2026-01-06 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/SakanaAI/TransEvalnia
下载链接
链接失效反馈
官方服务:
资源简介:
# TransEvalnia dataset Paper: [arxiv](https://arxiv.org/abs/2507.12724) | Github: [SakanaAI/TransEvalnia](https://github.com/SakanaAI/TransEvalnia) ## Introduction **[TransEvalnia](#)** is a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This repo presents the dataset used in the work. <img src="https://cdn-uploads.huggingface.co/production/uploads/605aebdece105fbcadcb8f3d/owp3vZDRul1406bXuEt8N.png" width="800"> The dataset consists of two parts. The **with_human_ranking** part includes 3,000 translation triplets with human scores. The data was mainly used for evaluating ranking accuracy. The **human_verification** part includes 800 model-generated evaluations and their verification from human annotators. The data was mainly used for meta-evaluation. ## The *with_human_ranking* data The **with_human_ranking** data contains over 3,000 translation triplets (src, tgt1, tgt2) from the following 7 data sources and their reasoning-based evaluations from Qwen2.5-72B-Instruct and Claude Sonnet 3.5. * `hard en-ja`: 47 English-Japanese translation triplets curated by expert translators * `wmt 2021 en-ja`: 500 English-Japanese translation triplets from WMT 2021 DA * `wmt 2021 ja-en`: 500 Japanese-English translation triplets from WMT 2021 DA * `wmt 2022 en-ru`: 497 English-Russian translation triplets from WMT 2022 MQM * `wmt 2023 en-de`: 500 English-German translation triplets from WMT 2023 MQM * `wmt 2023 zh-en`: 500 Chinese-English translation triplets from WMT 2023 MQM * `wmt 2024 en-es`: 499 English-Spanish translation triplets from WMT 2024 MQM Every data item has the following fields: - `dataset: str` - Dataset name. - `src_text: str` - Source text. - `tgt_texts: list[str]` - Texts of two translations. - `src_lang: str` - Source language code (e.g. ja). - `tgt_lang: str` - Target language code (e.g. en). - `src_lang_long: str` - Source language full name (e.g. Japanese). - `src_lang_long: str` - Source language full name (e.g. English). - `human_scores: list[float]` - Ground-truth human scores of the two translations. - For `hard en-ja`, a human score is between [0, 10]. A higher score indicates higher translation quality. - For datasets sourced from WMT DA, a human score is the z-score of the original direct assessment rating. A higher score indicates higher translation quality. - For datasets sourced from WMT MQM, a human score is the negated MQM score. A higher score indicates higher translation quality. - `one_step_ranking/{model}: str` - Model generated ranking decision using the one-step method. - `dim_evals/{model}: list[str]` - Model generated dimensional evaluations for each of the two translations. - `two_step_ranking/{model}: str` - Model generated ranking decision based on `dim_evals/{model}`, using the two-step method. - `two_step_scoring/{model}: str` - Model generated scoring decision based on `dim_evals/{model}`, using the two-step method. - `interleaved_dim_evals: str` - Model generated interleaved dimensional evaluations, based on `dim_evals/{model}`, using the three-step method. - `three_step_ranking/{model}: str` - Model generated ranking decision based on `interleaved_dim_evals/{model}`, using the three-step method. The dependency between the fields can be visualized as following. ``` src_text tgt_texts src_lang tgt_lang ├── one_step_ranking/{model} └── dim_evals/{model} ├── two_step_ranking/{model} ├── two_step_scoring/{model} └── interleaved_dim_evals/{model} └── three_step_ranking/{model} ``` Note: `{model}` can be `qwen` or `claude`. ## The *human_verification* data The **human_verification** data contains 800 model-generated evaluations and their human verifications. The verifications were collected from two translation service vendors. * `human_verification-generic_claude-vendor1`: Claude Sonnet 3.5's evaluations of 200 translations from multiple domains. Annotated by vendor 1. * `human_verification-generic_claude-vendor2`: Claude Sonnet 3.5's evaluations of 200 translations from multiple domains. Annotated by vendor 2. * `human_verification-generic_qwen-vendor2`: Qwen2.5-72B-Instruct's evaluations of 200 translations from multiple domains. Annotated by vendor 2. * `human_verification-haiku_qwen-vendor2`: Qwen2.5-72B-Instruct's evaluations of 200 Haiku translations. Annotated by vendor 2. Every data item has the following fields: - `idx: int` - Data index. - `source: str` - Source text. - `translation: str` - Translation text. - `evaluation: str` - Model-generated evaluation. - `annotations: dict` - Human verification of the model-generated evaluation. Data provided by different vendors have different structures. - `system: str` - Model that was used to generate the translation. ## Citation ``` @misc{sproat2025transevalniareasoningbasedevaluationranking, title={TransEvalnia: Reasoning-based Evaluation and Ranking of Translations}, author={Richard Sproat and Tianyu Zhao and Llion Jones}, year={2025}, eprint={2507.12724}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.12724}, } ```

# TransEvalnia 数据集 论文:[arxiv](https://arxiv.org/abs/2507.12724) | GitHub:[SakanaAI/TransEvalnia](https://github.com/SakanaAI/TransEvalnia) ## 简介 **[TransEvalnia](#)** 是一款基于提示词的翻译评估与排序系统,其评估与排序流程内嵌推理机制。本仓库即为本研究工作所配套的数据集。 <img src="https://cdn-uploads.huggingface.co/production/uploads/605aebdece105fbcadcb8f3d/owp3vZDRul1406bXuEt8N.png" width="800"> 本数据集包含两个核心部分: 1. **带人类标注排序(with_human_ranking)**:包含3000组带人类评分的翻译三元组,主要用于评估排序模型的准确率。 2. **人类验证(human_verification)**:包含800个模型生成的翻译评估结果及人类标注者的验证结果,主要用于元评估(meta-evaluation)。 ## 带人类标注排序(with_human_ranking)数据集 该数据集包含来自7个数据源的逾3000组翻译三元组(源文本src、目标译文tgt1、tgt2),以及Qwen2.5-72B-Instruct和Claude Sonnet 3.5生成的基于推理的评估结果,各数据源详情如下: * `hard en-ja`:47组由专业译员精心整理的英-日翻译三元组 * `wmt 2021 en-ja`:500组来自WMT 2021直接评估(Direct Assessment, DA)的英-日翻译三元组 * `wmt 2021 ja-en`:500组来自WMT 2021直接评估(DA)的日-英翻译三元组 * `wmt 2022 en-ru`:497组来自WMT 2022多维度质量评估(Multidimensional Quality Metric, MQM)的英-俄翻译三元组 * `wmt 2023 en-de`:500组来自WMT 2023 MQM的英-德翻译三元组 * `wmt 2023 zh-en`:500组来自WMT 2023 MQM的中-英翻译三元组 * `wmt 2024 en-es`:499组来自WMT 2024 MQM的英-西翻译三元组 每个数据项包含以下字段: - `dataset: str`:数据集名称。 - `src_text: str`:源文本。 - `tgt_texts: list[str]`:两组译文文本。 - `src_lang: str`:源语言代码(例如`ja`)。 - `tgt_lang: str`:目标语言代码(例如`en`)。 - `src_lang_long: str`:源语言全称(例如`Japanese`)。 - `tgt_lang_long: str`:目标语言全称(例如`English`)。 - `human_scores: list[float]`:两组译文的真实人类评分。 - 针对`hard en-ja`数据集,人类评分范围为[0, 10],分值越高代表翻译质量越好。 - 针对WMT DA来源的数据集,人类评分为原始直接评估评分的z分数,分值越高代表翻译质量越好。 - 针对WMT MQM来源的数据集,人类评分为取反后的MQM评分,分值越高代表翻译质量越好。 - `one_step_ranking/{model}: str`:采用单步方法生成的模型排序决策。 - `dim_evals/{model}: list[str]`:针对两组译文,模型生成的维度化评估结果。 - `two_step_ranking/{model}: str`:基于`dim_evals/{model}`,采用两步方法生成的模型排序决策。 - `two_step_scoring/{model}: str`:基于`dim_evals/{model}`,采用两步方法生成的模型评分决策。 - `interleaved_dim_evals: str`:基于`dim_evals/{model}`,采用三步方法生成的交错式维度化评估结果。 - `three_step_ranking/{model}: str`:基于`interleaved_dim_evals/{model}`,采用三步方法生成的模型排序决策。 各字段间的依赖关系可可视化如下: src_text tgt_texts src_lang tgt_lang ├── one_step_ranking/{model} └── dim_evals/{model} ├── two_step_ranking/{model} ├── two_step_scoring/{model} └── interleaved_dim_evals/{model} └── three_step_ranking/{model} 注:`{model}` 可取值为`qwen` 或 `claude`。 ## 人类验证(human_verification)数据集 该数据集包含800个模型生成的翻译评估结果及对应的人类验证结果,验证结果来自两家翻译服务供应商: * `human_verification-generic_claude-vendor1`:Claude Sonnet 3.5对多领域200组翻译生成的评估结果,由供应商1标注。 * `human_verification-generic_claude-vendor2`:Claude Sonnet 3.5对多领域200组翻译生成的评估结果,由供应商2标注。 * `human_verification-generic_qwen-vendor2`:Qwen2.5-72B-Instruct对多领域200组翻译生成的评估结果,由供应商2标注。 * `human_verification-haiku_qwen-vendor2`:Qwen2.5-72B-Instruct对200组俳句翻译生成的评估结果,由供应商2标注。 每个数据项包含以下字段: - `idx: int`:数据索引。 - `source: str`:源文本。 - `translation: str`:译文文本。 - `evaluation: str`:模型生成的评估结果。 - `annotations: dict`:对模型生成评估结果的人类验证结果,不同供应商提供的数据结构存在差异。 - `system: str`:用于生成译文的模型。 ## 引用 @misc{sproat2025transevalniareasoningbasedevaluationranking, title={TransEvalnia: Reasoning-based Evaluation and Ranking of Translations}, author={Richard Sproat and Tianyu Zhao and Llion Jones}, year={2025}, eprint={2507.12724}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.12724}, }
提供机构:
maas
创建时间:
2025-07-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作