deja-vu-pairwise-evals
收藏魔搭社区2025-11-27 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/CohereLabs/deja-vu-pairwise-evals
下载链接
链接失效反馈官方服务:
资源简介:
# Automatic pairwise preference evaluations for "Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation"
## Content
This data contains pairwise automatic win-rate evaluations for 2 benchmarks.
1. Outputs and judge decisions for the [m-ArenaHard](https://huggingface.co/datasets/CohereLabs/m-ArenaHard) benchmark for sampled generations (5 each) from [Aya Expanse 8B](https://huggingface.co/CohereLabs/aya-expanse-8b) and [Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct).
2. Original and roundtrip-translated prompts (by NLLB 3.3B, Aya Expanse 32B, Google Translate, Command A), outputs and judge decisions for the [aya_human_annotated](https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite/viewer/aya_human_annotated) benchmark for sampled generations (1 each) from [Aya Expanse 8B](https://huggingface.co/CohereLabs/aya-expanse-8b) and [Gemma2 9B it](https://huggingface.co/google/gemma-2-9b-it).
Model outputs are compared in pairs, and judged by GPT4o.
For an analysis and context of these evaluations, check out the paper [Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation](https://arxiv.org/abs/2504.11829).
## Format
The data is organized in a nested dictionary by language and repetition, and contains additional meta-information about the evaluation that is the same for all languages.
Below we explain the format for each file, annotations in "<>":
1. `win-rate_m-arenahard_aya-expanse-8B_qwen2.5-7B_gpt4o_5repeats.json`
```
{language:
{repeat_no:
{"prompt": <mArenaHard prompt>,
"CohereForAI_aya-expanse-8B": <Aya Expanse 8B generation>,
"Qwen_Qwen2.5-7B-Instruct": <Qwen2.5 7B Instruct generation>,
"winner": <GPT4o winner in pairwise preference evaluation, either of the two model names>
}
}
"meta_information":
{'judge': <LLM judge name incl. version>,
'judge_prompt': <LLM judge evaluation prompt template>,
'judge_system_prompt': <LLM judge system prompt template>,
'vllm_decoding_configuration': <vLLM decoding configuration>,
'vllm_version': <vLLM version>
}
}
```
2. `win-rate_roundtrip-translated_human-annotated_aya-expanse-8B_gemma2-9b-it_gpt4o.json`
```
{language:
[{"id": id,
"prompt": <original aya human annotated prompt>,
"prompt_translated_<translator>": <<translator> translated prompt into the target language>,
"prompt_pivot_<translator>": <<translator> translated prompt into the pivot language>,
"google_gemma-2-9b-it_completion_original": <Gemma generation for the original prompt>,
"CohereForAI_aya_expanse-8b_completion_original": <Aya Expanse generation for the original prompt>,
"google_gemma-2-9b-it_completion_translated_<translator>": <Gemma generation for <translator> translated prompt>,
"CohereForAI_aya_expanse-8b_completion_translated_<translator>": <Aya Expanse generation for <translator> translated prompt>,
"original_winner": <GPT4o winner in pairwise comparisons on original prompts>,
"translated_<translator>_winner": <GPT4o winner in pairwise comparisons on prompts of that translator>,
}]
"meta_information":
{'judge': <LLM judge name incl. version>,
'judge_prompt': <LLM judge evaluation prompt template>,
'judge_system_prompt': <LLM judge system prompt template>,
'vllm_decoding_configuration': <vLLM decoding configuration>,
'vllm_version': <vLLM version>
}
}
```
## Use
**This data may not be used for model training!**
You may use this data to conduct analyses of model differences, evaluate other judges against GPT4o, or similar inference-only experiments.
Make sure to additionally respect the individual licenses for using outputs from Aya, Qwen, Gemma, Google Translate, NLLB, GPT4o, Command A models.
## Citation
If you use this data for your research, please cite our work accordingly:
```
@misc{kreutzer2025dejavumultilingualllm,
title={D\'ej\`a Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation},
author={Julia Kreutzer and Eleftheria Briakou and Sweta Agrawal and Marzieh Fadaee and Kocmi Tom},
year={2025},
eprint={2504.11829},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.11829},
}
```
# 《Déjà Vu:从机器翻译评估视角审视多语言大语言模型评估》的自动成对偏好评估数据集
## 内容
本数据集包含两个基准测试的自动成对胜率评估结果。
1. 包含来自[Aya Expanse 8B](https://huggingface.co/CohereLabs/aya-expanse-8b)与[Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)的采样生成结果(每个模型各5条),对应[m-ArenaHard](https://huggingface.co/datasets/CohereLabs/m-ArenaHard)基准测试的模型输出与评估者判定结果。
2. 包含由NLLB 3.3B、Aya Expanse 32B、Google Translate、Command A完成的原提示词与往返翻译提示词、模型输出及评估者判定结果,对应[aya_human_annotated](https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite/viewer/aya_human_annotated)基准测试中,来自[Aya Expanse 8B](https://huggingface.co/CohereLabs/aya-expanse-8b)与[Gemma2 9B it](https://huggingface.co/google/gemma-2-9b-it)的采样生成结果(每个模型各1条)。
模型输出将以成对形式进行比较,并由GPT-4o (GPT4o)完成评估判定。如需了解本次评估的分析与背景信息,请参阅论文《Déjà Vu:从机器翻译评估视角审视多语言大语言模型评估》(https://arxiv.org/abs/2504.11829)。
## 格式
本数据集按语言与重复次数以嵌套字典形式组织,并包含适用于所有语言的通用评估元信息。
下文将对各文件的格式进行说明,<>内为标注项:
1. `win-rate_m-arenahard_aya-expanse-8B_qwen2.5-7B_gpt4o_5repeats.json`
{language:
{repeat_no:
{"prompt": <mArenaHard提示词>,
"CohereForAI_aya-expanse-8B": <Aya Expanse 8B生成结果>,
"Qwen_Qwen2.5-7B-Instruct": <Qwen2.5 7B Instruct生成结果>,
"winner": <GPT-4o成对偏好评估中的获胜模型名称,即两个模型名称之一>
}
}
"meta_information":
{'judge': <包含版本信息的大语言模型(LLM)评估者名称>,
'judge_prompt': <大语言模型(LLM)评估者的评估提示词模板>,
'judge_system_prompt': <大语言模型(LLM)评估者的系统提示词模板>,
'vllm_decoding_configuration': <vLLM解码配置>,
'vllm_version': <vLLM版本号>
}
}
2. `win-rate_roundtrip-translated_human-annotated_aya-expanse-8B_gemma2-9b-it_gpt4o.json`
{language:
[{"id": 样本ID,
"prompt": <原aya人工标注提示词>,
"prompt_translated_<翻译器名称>": <由<翻译器名称>将提示词翻译为目标语言的结果>,
"prompt_pivot_<翻译器名称>": <由<翻译器名称>将提示词翻译为中间语言的结果>,
"google_gemma-2-9b-it_completion_original": <Gemma2 9B it针对原提示词的生成结果>,
"CohereForAI_aya_expanse-8b_completion_original": <Aya Expanse 8B针对原提示词的生成结果>,
"google_gemma-2-9b-it_completion_translated_<翻译器名称>": <Gemma2 9B it针对<翻译器名称>翻译后提示词的生成结果>,
"CohereForAI_aya_expanse-8b_completion_translated_<翻译器名称>": <Aya Expanse 8B针对<翻译器名称>翻译后提示词的生成结果>,
"original_winner": <原提示词成对比较中GPT-4o判定的获胜模型>,
"translated_<翻译器名称>_winner": <经<翻译器名称>翻译的提示词成对比较中GPT-4o判定的获胜模型>,
}]
"meta_information":
{'judge': <包含版本信息的大语言模型(LLM)评估者名称>,
'judge_prompt': <大语言模型(LLM)评估者的评估提示词模板>,
'judge_system_prompt': <大语言模型(LLM)评估者的系统提示词模板>,
'vllm_decoding_configuration': <vLLM解码配置>,
'vllm_version': <vLLM版本号>
}
}
## 使用限制
**本数据集不得用于模型训练!**
您可将本数据集用于模型差异分析、基于GPT-4o的其他评估者对比实验,或其他仅需推理的同类实验。同时,请务必遵守Aya、Qwen、Gemma、Google Translate、NLLB、GPT-4o、Command A等模型输出的相关使用许可条款。
## 引用方式
若您在研究中使用本数据集,请按以下方式引用本工作:
@misc{kreutzer2025dejavumultilingualllm,
title={D'ej`a Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation},
author={Julia Kreutzer and Eleftheria Briakou and Sweta Agrawal and Marzieh Fadaee and Kocmi Tom},
year={2025},
eprint={2504.11829},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.11829},
}
提供机构:
maas
创建时间:
2025-08-01



