five

Marco-Bench-MIF

收藏
魔搭社区2026-05-19 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/AIDC-AI/Marco-Bench-MIF
下载链接
链接失效反馈
官方服务:
资源简介:
# Marco-Bench-MIF: A Benchmark for Multilingual Instruction-Following Evaluation [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0) [![ACL 2025](https://img.shields.io/badge/ACL-2025-blue)](https://aclanthology.org/2025.acl-long.1172/) [![arXiv](https://img.shields.io/badge/arXiv-2507.11882-b31b1b.svg)](https://arxiv.org/abs/2507.11882) ## Introduction Marco-Bench-MIF is the first deeply localized multilingual benchmark designed to evaluate instruction-following capabilities across 30 languages. Unlike existing benchmarks that rely primarily on machine translation, Marco-Bench-MIF implements fine-grained cultural adaptations to provide more accurate assessment. Our research demonstrates that machine-translated data underestimates model performance by 7-22% in multilingual environments. ## Key Features - **Extensive Language Coverage**: 30 languages spanning 6 major language families, including high-resource (English, Chinese, German) and low-resource languages (Yoruba, Nepali) - **Deep Cultural Localization**: Three-step process of lexical replacement, theme transformation, and pragmatic reconstruction to ensure cultural and linguistic appropriateness - **Diverse Constraint Types**: 541 instruction-response pairs covering single/multiple constraints, expressive/content constraints, and various instruction types - **Comparative Dataset**: Machine-translated and culturally-localized versions available for specific languages (Arabic, Chinese, Spanish, etc.) to enable comparative research ## Dataset Access The dataset will be available through our GitHub repository and Hugging Face: ```bash git clone https://github.com/AIDC-AI/Marco-Bench-MIF.git ``` ## Key Findings Our benchmark evaluated 20+ LLM models and revealed: 1. Model scale strongly correlates with performance, with 70B+ models outperforming 8B models by 45-60% 2. A 25-35% performance gap exists between high-resource languages (German, Chinese) and low-resource languages (Yoruba, Nepali) 3. Significant differences between localized and machine-translated evaluations, especially for complex instructions ## Contact For questions or suggestions, please submit a GitHub issue or contact us: - Email: lyuchenyang.lcy@alibaba-inc.com - Project homepage: https://github.com/AIDC-AI/Marco-Bench-MIF ## License This dataset is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). ## Acknowledgments Special thanks to all annotators and translators who participated in dataset construction and validation. This project is supported by Alibaba International Digital Commerce Group.

# Marco-Bench-MIF:面向多语言指令遵循能力评估的基准数据集 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0) [![ACL 2025](https://img.shields.io/badge/ACL-2025-blue)](https://aclanthology.org/2025.acl-long.1172/) [![arXiv](https://img.shields.io/badge/arXiv-2507.11882-b31b1b.svg)](https://arxiv.org/abs/2507.11882) ## 引言 Marco-Bench-MIF是首个面向30种语言、经过深度本土化优化的多语言基准数据集,用于评估大语言模型(Large Language Model)的指令遵循能力。与当前主流依赖机器翻译的基准数据集不同,Marco-Bench-MIF通过精细化的文化适配流程,能够提供更为精准的模型性能评估。本研究表明,在多语言场景下,仅使用机器翻译生成的数据集会低估模型性能7%-22%。 ## 核心特性 - **覆盖语种广泛**:涵盖6大语系的30种语言,包含高资源语言(英语、中文、德语)与低资源语言(约鲁巴语、尼泊尔语) - **深度文化适配**:通过词汇替换、主题转换、语用重构三步流程,确保数据集在文化与语言层面的适配性 - **约束类型多样**:包含541组指令-响应对,覆盖单/多约束、表达型/内容型约束以及多种指令类型 - **可对比数据集**:针对阿拉伯语、中文、西班牙语等特定语言,同时提供机器翻译版本与文化适配版本,支持对比研究 ## 数据集获取 本数据集将通过GitHub仓库与Hugging Face平台发布: bash git clone https://github.com/AIDC-AI/Marco-Bench-MIF.git ## 核心发现 本基准数据集对20余款大语言模型进行了评估,结果显示: 1. 模型规模与性能呈强正相关,70B参数量以上的模型性能较8B参数量模型高出45%-60% 2. 高资源语言(德语、中文)与低资源语言(约鲁巴语、尼泊尔语)之间存在25%-35%的性能差距 3. 文化适配版与机器翻译版的评估结果存在显著差异,在复杂指令场景下尤为明显 ## 联系方式 如有疑问或建议,请提交GitHub Issue或通过以下方式联系我们: - 邮箱:lyuchenyang.lcy@alibaba-inc.com - 项目主页:https://github.com/AIDC-AI/Marco-Bench-MIF ## 许可证 本数据集采用[Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)许可证进行授权。 ## 致谢 特别感谢所有参与数据集构建与验证的标注人员与翻译人员。本项目得到阿里巴巴国际数字商业集团的支持。
提供机构:
maas
创建时间:
2025-10-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作