five

P-MMEval

收藏
魔搭社区2026-05-13 更新2024-11-16 收录
下载链接:
https://modelscope.cn/datasets/Qwen/P-MMEval
下载链接
链接失效反馈
官方服务:
资源简介:
# P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs ## Introduction We introduce a multilingual benchmark, P-MMEval, covering effective fundamental and capability-specialized datasets. We extend the existing benchmarks, ensuring consistent language coverage across all datasets and providing parallel samples among multiple languages, supporting up to 10 languages from 8 language families (i.e., en, zh, ar, es, ja, ko, th, fr, pt, vi). As a result, P-MMEval facilitates a holistic assessment of multilingual capabilities and comparative analysis of cross-lingual transferability. ## Supported Languages - Arabic - Spanish - French - Japanese - Korean - Portuguese - Thai - Vietnamese - English - Chinese ## Supported Tasks <img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/adic-93OnhRoSIk3P2VoS.png" width="1200" /> ## Main Results The multilingual capabilities of all models except for the LLaMA3.2 series improve with increasing model sizes, as LLaMA3.2-1B and LLaMA3.2-3B exhibit poor instruction-following capabilities, leading to a higher failure rate in answer extraction. In addition, Qwen2.5 demonstrates a strong multilingual performance on understanding and capability-specialized tasks, while Gemma2 excels in generation tasks. Closed-source models generally outperform open-source models. <img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/dGpAuDPT53TDHEW5wFZWk.png" width="1200" /> ## Citation We've published our paper at [this link](https://arxiv.org/pdf/2411.09116). If you find this dataset is helpful, please cite our paper as follows: ``` @misc{zhang2024pmmevalparallelmultilingualmultitask, title={P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs}, author={Yidan Zhang and Yu Wan and Boyi Deng and Baosong Yang and Haoran Wei and Fei Huang and Bowen Yu and Junyang Lin and Fei Huang and Jingren Zhou}, year={2024}, eprint={2411.09116}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.09116}, } ``` # Usage You can use OpenCompass if you want to evaluate your LLMs on P-MMEval . We advice you to use vllm to accelerate the evaluation (requiring vllm installation): ``` # CLI opencompass --models hf_internlm2_5_1_8b_chat --datasets pmmeval_gen -a vllm # Python scripts opencompass ./configs/eval_PMMEval.py ```

# P-MMEval:用于大语言模型(Large Language Model,LLM)一致性评估的并行多语言多任务基准 ## 简介 本研究提出一款多语言基准数据集P-MMEval,其涵盖优质基础能力数据集与专项能力数据集。本基准对现有基准进行拓展,确保所有数据集的语言覆盖范围保持一致,并在多语言间提供并行样本,可支持来自8个语系的多达10种语言,即英语(en)、中文(zh)、阿拉伯语(ar)、西班牙语(es)、日语(ja)、韩语(ko)、泰语(th)、法语(fr)、葡萄牙语(pt)以及越南语(vi)。借此,P-MMEval可实现对大语言模型多语言能力的全面评估,以及跨语言迁移性的对比分析。 ## 支持语言 - 阿拉伯语 - 西班牙语 - 法语 - 日语 - 韩语 - 葡萄牙语 - 泰语 - 越南语 - 英语 - 中文 ## 支持任务 <img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/adic-93OnhRoSIk3P2VoS.png" width="1200" /> ## 主要实验结果 除LLaMA3.2系列外,其余所有模型的多语言能力均随模型参数量的提升而增强——LLaMA3.2-1B与LLaMA3.2-3B的指令遵循能力较差,导致答案抽取环节的失败率更高。此外,Qwen2.5在理解任务与专项能力任务上展现出优异的多语言性能,而Gemma2则在生成任务上表现突出。闭源模型整体性能优于开源模型。 <img src="https://cdn-uploads.huggingface.co/production/uploads/64abba3303cd5dee2efa6ee9/dGpAuDPT53TDHEW5wFZWk.png" width="1200" /> ## 引用格式 本研究的相关论文已发表于[此链接](https://arxiv.org/pdf/2411.09116)。若您认为本数据集对您的研究有所帮助,请按如下格式引用我们的论文: @misc{zhang2024pmmevalparallelmultilingualmultitask, title={P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs}, author={Yidan Zhang and Yu Wan and Boyi Deng and Baosong Yang and Haoran Wei and Fei Huang and Bowen Yu and Junyang Lin and Fei Huang and Jingren Zhou}, year={2024}, eprint={2411.09116}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.09116}, } ## 使用方法 若您希望在P-MMEval上评估您的大语言模型,可使用OpenCompass工具。我们建议您使用vllm以加速评估流程(需提前安装vllm): # CLI opencompass --models hf_internlm2_5_1_8b_chat --datasets pmmeval_gen -a vllm # Python scripts opencompass ./configs/eval_PMMEval.py
提供机构:
maas
创建时间:
2024-12-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作