P-MMEval
收藏魔搭社区2026-05-13 更新2024-11-16 收录
下载链接:
https://modelscope.cn/datasets/Qwen/P-MMEval
下载链接
链接失效反馈官方服务:
资源简介:
# P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs
## Introduction
We introduce a multilingual benchmark, P-MMEval, covering effective fundamental and capability-specialized datasets. We extend the existing benchmarks, ensuring consistent language coverage across all datasets and providing parallel samples among multiple languages, supporting up to 10 languages from 8 language families (i.e., en, zh, ar, es, ja, ko, th, fr, pt, vi). As a result, P-MMEval facilitates a holistic assessment of multilingual capabilities and comparative analysis of cross-lingual transferability.
## Supported Languages
- Arabic
- Spanish
- French
- Japanese
- Korean
- Portuguese
- Thai
- Vietnamese
- English
- Chinese
## Supported Tasks
<img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/adic-93OnhRoSIk3P2VoS.png" width="1200" />
## Main Results
The multilingual capabilities of all models except for the LLaMA3.2 series improve with increasing model sizes, as LLaMA3.2-1B and LLaMA3.2-3B exhibit poor instruction-following capabilities, leading to a higher failure rate in answer extraction. In addition, Qwen2.5 demonstrates a strong multilingual performance on understanding and capability-specialized tasks, while Gemma2 excels in generation tasks. Closed-source models generally outperform open-source models.
<img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/dGpAuDPT53TDHEW5wFZWk.png" width="1200" />
## Citation
We've published our paper at [this link](https://arxiv.org/pdf/2411.09116). If you find this dataset is helpful, please cite our paper as follows:
```
@misc{zhang2024pmmevalparallelmultilingualmultitask,
title={P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs},
author={Yidan Zhang and Yu Wan and Boyi Deng and Baosong Yang and Haoran Wei and Fei Huang and Bowen Yu and Junyang Lin and Fei Huang and Jingren Zhou},
year={2024},
eprint={2411.09116},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.09116},
}
```
# Usage
You can use OpenCompass if you want to evaluate your LLMs on P-MMEval . We advice you to use vllm to accelerate the evaluation (requiring vllm installation):
```
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets pmmeval_gen -a vllm
# Python scripts
opencompass ./configs/eval_PMMEval.py
```
# P-MMEval:用于大语言模型(Large Language Model,LLM)一致性评估的并行多语言多任务基准
## 简介
本研究提出一款多语言基准数据集P-MMEval,其涵盖优质基础能力数据集与专项能力数据集。本基准对现有基准进行拓展,确保所有数据集的语言覆盖范围保持一致,并在多语言间提供并行样本,可支持来自8个语系的多达10种语言,即英语(en)、中文(zh)、阿拉伯语(ar)、西班牙语(es)、日语(ja)、韩语(ko)、泰语(th)、法语(fr)、葡萄牙语(pt)以及越南语(vi)。借此,P-MMEval可实现对大语言模型多语言能力的全面评估,以及跨语言迁移性的对比分析。
## 支持语言
- 阿拉伯语
- 西班牙语
- 法语
- 日语
- 韩语
- 葡萄牙语
- 泰语
- 越南语
- 英语
- 中文
## 支持任务
<img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/adic-93OnhRoSIk3P2VoS.png" width="1200" />
## 主要实验结果
除LLaMA3.2系列外,其余所有模型的多语言能力均随模型参数量的提升而增强——LLaMA3.2-1B与LLaMA3.2-3B的指令遵循能力较差,导致答案抽取环节的失败率更高。此外,Qwen2.5在理解任务与专项能力任务上展现出优异的多语言性能,而Gemma2则在生成任务上表现突出。闭源模型整体性能优于开源模型。
<img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/dGpAuDPT53TDHEW5wFZWk.png" width="1200" />
## 引用格式
本研究的相关论文已发表于[此链接](https://arxiv.org/pdf/2411.09116)。若您认为本数据集对您的研究有所帮助,请按如下格式引用我们的论文:
@misc{zhang2024pmmevalparallelmultilingualmultitask,
title={P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs},
author={Yidan Zhang and Yu Wan and Boyi Deng and Baosong Yang and Haoran Wei and Fei Huang and Bowen Yu and Junyang Lin and Fei Huang and Jingren Zhou},
year={2024},
eprint={2411.09116},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.09116},
}
## 使用方法
若您希望在P-MMEval上评估您的大语言模型,可使用OpenCompass工具。我们建议您使用vllm以加速评估流程(需提前安装vllm):
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets pmmeval_gen -a vllm
# Python scripts
opencompass ./configs/eval_PMMEval.py
提供机构:
maas
创建时间:
2024-12-10



