P-MMEval

Name: P-MMEval
Creator: maas
Published: 2026-05-13 20:02:38
License: 暂无描述

魔搭社区2026-05-13 更新2024-11-16 收录

下载链接：

https://modelscope.cn/datasets/Qwen/P-MMEval

下载链接

链接失效反馈

官方服务：

资源简介：

# P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs ## Introduction We introduce a multilingual benchmark, P-MMEval, covering effective fundamental and capability-specialized datasets. We extend the existing benchmarks, ensuring consistent language coverage across all datasets and providing parallel samples among multiple languages, supporting up to 10 languages from 8 language families (i.e., en, zh, ar, es, ja, ko, th, fr, pt, vi). As a result, P-MMEval facilitates a holistic assessment of multilingual capabilities and comparative analysis of cross-lingual transferability. ## Supported Languages - Arabic - Spanish - French - Japanese - Korean - Portuguese - Thai - Vietnamese - English - Chinese ## Supported Tasks <img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/adic-93OnhRoSIk3P2VoS.png" width="1200" /> ## Main Results The multilingual capabilities of all models except for the LLaMA3.2 series improve with increasing model sizes, as LLaMA3.2-1B and LLaMA3.2-3B exhibit poor instruction-following capabilities, leading to a higher failure rate in answer extraction. In addition, Qwen2.5 demonstrates a strong multilingual performance on understanding and capability-specialized tasks, while Gemma2 excels in generation tasks. Closed-source models generally outperform open-source models. <img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/dGpAuDPT53TDHEW5wFZWk.png" width="1200" /> ## Citation We've published our paper at [this link](https://arxiv.org/pdf/2411.09116). If you find this dataset is helpful, please cite our paper as follows: ``` @misc{zhang2024pmmevalparallelmultilingualmultitask, title={P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs}, author={Yidan Zhang and Yu Wan and Boyi Deng and Baosong Yang and Haoran Wei and Fei Huang and Bowen Yu and Junyang Lin and Fei Huang and Jingren Zhou}, year={2024}, eprint={2411.09116}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.09116}, } ``` # Usage You can use OpenCompass if you want to evaluate your LLMs on P-MMEval . We advice you to use vllm to accelerate the evaluation (requiring vllm installation): ``` # CLI opencompass --models hf_internlm2_5_1_8b_chat --datasets pmmeval_gen -a vllm # Python scripts opencompass ./configs/eval_PMMEval.py ```

# P-MMEval：用于大语言模型（Large Language Model，LLM）一致性评估的并行多语言多任务基准 ## 简介本研究提出一款多语言基准数据集P-MMEval，其涵盖优质基础能力数据集与专项能力数据集。本基准对现有基准进行拓展，确保所有数据集的语言覆盖范围保持一致，并在多语言间提供并行样本，可支持来自8个语系的多达10种语言，即英语（en）、中文（zh）、阿拉伯语（ar）、西班牙语（es）、日语（ja）、韩语（ko）、泰语（th）、法语（fr）、葡萄牙语（pt）以及越南语（vi）。借此，P-MMEval可实现对大语言模型多语言能力的全面评估，以及跨语言迁移性的对比分析。 ## 支持语言 - 阿拉伯语 - 西班牙语 - 法语 - 日语 - 韩语 - 葡萄牙语 - 泰语 - 越南语 - 英语 - 中文 ## 支持任务 <img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/adic-93OnhRoSIk3P2VoS.png" width="1200" /> ## 主要实验结果除LLaMA3.2系列外，其余所有模型的多语言能力均随模型参数量的提升而增强——LLaMA3.2-1B与LLaMA3.2-3B的指令遵循能力较差，导致答案抽取环节的失败率更高。此外，Qwen2.5在理解任务与专项能力任务上展现出优异的多语言性能，而Gemma2则在生成任务上表现突出。闭源模型整体性能优于开源模型。 <img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/dGpAuDPT53TDHEW5wFZWk.png" width="1200" /> ## 引用格式本研究的相关论文已发表于[此链接](https://arxiv.org/pdf/2411.09116)。若您认为本数据集对您的研究有所帮助，请按如下格式引用我们的论文： @misc{zhang2024pmmevalparallelmultilingualmultitask, title={P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs}, author={Yidan Zhang and Yu Wan and Boyi Deng and Baosong Yang and Haoran Wei and Fei Huang and Bowen Yu and Junyang Lin and Fei Huang and Jingren Zhou}, year={2024}, eprint={2411.09116}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.09116}, } ## 使用方法若您希望在P-MMEval上评估您的大语言模型，可使用OpenCompass工具。我们建议您使用vllm以加速评估流程（需提前安装vllm）： # CLI opencompass --models hf_internlm2_5_1_8b_chat --datasets pmmeval_gen -a vllm # Python scripts opencompass ./configs/eval_PMMEval.py

提供机构：

maas

创建时间：

2024-12-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集