five

deja-vu-pairwise-evals

收藏
魔搭社区2025-11-27 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/CohereLabs/deja-vu-pairwise-evals
下载链接
链接失效反馈
官方服务:
资源简介:
# Automatic pairwise preference evaluations for "Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation" ## Content This data contains pairwise automatic win-rate evaluations for 2 benchmarks. 1. Outputs and judge decisions for the [m-ArenaHard](https://huggingface.co/datasets/CohereLabs/m-ArenaHard) benchmark for sampled generations (5 each) from [Aya Expanse 8B](https://huggingface.co/CohereLabs/aya-expanse-8b) and [Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct). 2. Original and roundtrip-translated prompts (by NLLB 3.3B, Aya Expanse 32B, Google Translate, Command A), outputs and judge decisions for the [aya_human_annotated](https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite/viewer/aya_human_annotated) benchmark for sampled generations (1 each) from [Aya Expanse 8B](https://huggingface.co/CohereLabs/aya-expanse-8b) and [Gemma2 9B it](https://huggingface.co/google/gemma-2-9b-it). Model outputs are compared in pairs, and judged by GPT4o. For an analysis and context of these evaluations, check out the paper [Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation](https://arxiv.org/abs/2504.11829). ## Format The data is organized in a nested dictionary by language and repetition, and contains additional meta-information about the evaluation that is the same for all languages. Below we explain the format for each file, annotations in "<>": 1. `win-rate_m-arenahard_aya-expanse-8B_qwen2.5-7B_gpt4o_5repeats.json` ``` {language: {repeat_no: {"prompt": <mArenaHard prompt>, "CohereForAI_aya-expanse-8B": <Aya Expanse 8B generation>, "Qwen_Qwen2.5-7B-Instruct": <Qwen2.5 7B Instruct generation>, "winner": <GPT4o winner in pairwise preference evaluation, either of the two model names> } } "meta_information": {'judge': <LLM judge name incl. version>, 'judge_prompt': <LLM judge evaluation prompt template>, 'judge_system_prompt': <LLM judge system prompt template>, 'vllm_decoding_configuration': <vLLM decoding configuration>, 'vllm_version': <vLLM version> } } ``` 2. `win-rate_roundtrip-translated_human-annotated_aya-expanse-8B_gemma2-9b-it_gpt4o.json` ``` {language: [{"id": id, "prompt": <original aya human annotated prompt>, "prompt_translated_<translator>": <<translator> translated prompt into the target language>, "prompt_pivot_<translator>": <<translator> translated prompt into the pivot language>, "google_gemma-2-9b-it_completion_original": <Gemma generation for the original prompt>, "CohereForAI_aya_expanse-8b_completion_original": <Aya Expanse generation for the original prompt>, "google_gemma-2-9b-it_completion_translated_<translator>": <Gemma generation for <translator> translated prompt>, "CohereForAI_aya_expanse-8b_completion_translated_<translator>": <Aya Expanse generation for <translator> translated prompt>, "original_winner": <GPT4o winner in pairwise comparisons on original prompts>, "translated_<translator>_winner": <GPT4o winner in pairwise comparisons on prompts of that translator>, }] "meta_information": {'judge': <LLM judge name incl. version>, 'judge_prompt': <LLM judge evaluation prompt template>, 'judge_system_prompt': <LLM judge system prompt template>, 'vllm_decoding_configuration': <vLLM decoding configuration>, 'vllm_version': <vLLM version> } } ``` ## Use **This data may not be used for model training!** You may use this data to conduct analyses of model differences, evaluate other judges against GPT4o, or similar inference-only experiments. Make sure to additionally respect the individual licenses for using outputs from Aya, Qwen, Gemma, Google Translate, NLLB, GPT4o, Command A models. ## Citation If you use this data for your research, please cite our work accordingly: ``` @misc{kreutzer2025dejavumultilingualllm, title={D\'ej\`a Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation}, author={Julia Kreutzer and Eleftheria Briakou and Sweta Agrawal and Marzieh Fadaee and Kocmi Tom}, year={2025}, eprint={2504.11829}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.11829}, } ```

# 《Déjà Vu:从机器翻译评估视角审视多语言大语言模型评估》的自动成对偏好评估数据集 ## 内容 本数据集包含两个基准测试的自动成对胜率评估结果。 1. 包含来自[Aya Expanse 8B](https://huggingface.co/CohereLabs/aya-expanse-8b)与[Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)的采样生成结果(每个模型各5条),对应[m-ArenaHard](https://huggingface.co/datasets/CohereLabs/m-ArenaHard)基准测试的模型输出与评估者判定结果。 2. 包含由NLLB 3.3B、Aya Expanse 32B、Google Translate、Command A完成的原提示词与往返翻译提示词、模型输出及评估者判定结果,对应[aya_human_annotated](https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite/viewer/aya_human_annotated)基准测试中,来自[Aya Expanse 8B](https://huggingface.co/CohereLabs/aya-expanse-8b)与[Gemma2 9B it](https://huggingface.co/google/gemma-2-9b-it)的采样生成结果(每个模型各1条)。 模型输出将以成对形式进行比较,并由GPT-4o (GPT4o)完成评估判定。如需了解本次评估的分析与背景信息,请参阅论文《Déjà Vu:从机器翻译评估视角审视多语言大语言模型评估》(https://arxiv.org/abs/2504.11829)。 ## 格式 本数据集按语言与重复次数以嵌套字典形式组织,并包含适用于所有语言的通用评估元信息。 下文将对各文件的格式进行说明,<>内为标注项: 1. `win-rate_m-arenahard_aya-expanse-8B_qwen2.5-7B_gpt4o_5repeats.json` {language: {repeat_no: {"prompt": <mArenaHard提示词>, "CohereForAI_aya-expanse-8B": <Aya Expanse 8B生成结果>, "Qwen_Qwen2.5-7B-Instruct": <Qwen2.5 7B Instruct生成结果>, "winner": <GPT-4o成对偏好评估中的获胜模型名称,即两个模型名称之一> } } "meta_information": {'judge': <包含版本信息的大语言模型(LLM)评估者名称>, 'judge_prompt': <大语言模型(LLM)评估者的评估提示词模板>, 'judge_system_prompt': <大语言模型(LLM)评估者的系统提示词模板>, 'vllm_decoding_configuration': <vLLM解码配置>, 'vllm_version': <vLLM版本号> } } 2. `win-rate_roundtrip-translated_human-annotated_aya-expanse-8B_gemma2-9b-it_gpt4o.json` {language: [{"id": 样本ID, "prompt": <原aya人工标注提示词>, "prompt_translated_<翻译器名称>": <由<翻译器名称>将提示词翻译为目标语言的结果>, "prompt_pivot_<翻译器名称>": <由<翻译器名称>将提示词翻译为中间语言的结果>, "google_gemma-2-9b-it_completion_original": <Gemma2 9B it针对原提示词的生成结果>, "CohereForAI_aya_expanse-8b_completion_original": <Aya Expanse 8B针对原提示词的生成结果>, "google_gemma-2-9b-it_completion_translated_<翻译器名称>": <Gemma2 9B it针对<翻译器名称>翻译后提示词的生成结果>, "CohereForAI_aya_expanse-8b_completion_translated_<翻译器名称>": <Aya Expanse 8B针对<翻译器名称>翻译后提示词的生成结果>, "original_winner": <原提示词成对比较中GPT-4o判定的获胜模型>, "translated_<翻译器名称>_winner": <经<翻译器名称>翻译的提示词成对比较中GPT-4o判定的获胜模型>, }] "meta_information": {'judge': <包含版本信息的大语言模型(LLM)评估者名称>, 'judge_prompt': <大语言模型(LLM)评估者的评估提示词模板>, 'judge_system_prompt': <大语言模型(LLM)评估者的系统提示词模板>, 'vllm_decoding_configuration': <vLLM解码配置>, 'vllm_version': <vLLM版本号> } } ## 使用限制 **本数据集不得用于模型训练!** 您可将本数据集用于模型差异分析、基于GPT-4o的其他评估者对比实验,或其他仅需推理的同类实验。同时,请务必遵守Aya、Qwen、Gemma、Google Translate、NLLB、GPT-4o、Command A等模型输出的相关使用许可条款。 ## 引用方式 若您在研究中使用本数据集,请按以下方式引用本工作: @misc{kreutzer2025dejavumultilingualllm, title={D'ej`a Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation}, author={Julia Kreutzer and Eleftheria Briakou and Sweta Agrawal and Marzieh Fadaee and Kocmi Tom}, year={2025}, eprint={2504.11829}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.11829}, }
提供机构:
maas
创建时间:
2025-08-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作