Marco-Bench-MIF
收藏魔搭社区2026-05-19 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/AIDC-AI/Marco-Bench-MIF
下载链接
链接失效反馈官方服务:
资源简介:
# Marco-Bench-MIF: A Benchmark for Multilingual Instruction-Following Evaluation
[](https://www.apache.org/licenses/LICENSE-2.0)
[](https://aclanthology.org/2025.acl-long.1172/)
[](https://arxiv.org/abs/2507.11882)
## Introduction
Marco-Bench-MIF is the first deeply localized multilingual benchmark designed to evaluate instruction-following capabilities across 30 languages. Unlike existing benchmarks that rely primarily on machine translation, Marco-Bench-MIF implements fine-grained cultural adaptations to provide more accurate assessment. Our research demonstrates that machine-translated data underestimates model performance by 7-22% in multilingual environments.
## Key Features
- **Extensive Language Coverage**: 30 languages spanning 6 major language families, including high-resource (English, Chinese, German) and low-resource languages (Yoruba, Nepali)
- **Deep Cultural Localization**: Three-step process of lexical replacement, theme transformation, and pragmatic reconstruction to ensure cultural and linguistic appropriateness
- **Diverse Constraint Types**: 541 instruction-response pairs covering single/multiple constraints, expressive/content constraints, and various instruction types
- **Comparative Dataset**: Machine-translated and culturally-localized versions available for specific languages (Arabic, Chinese, Spanish, etc.) to enable comparative research
## Dataset Access
The dataset will be available through our GitHub repository and Hugging Face:
```bash
git clone https://github.com/AIDC-AI/Marco-Bench-MIF.git
```
## Key Findings
Our benchmark evaluated 20+ LLM models and revealed:
1. Model scale strongly correlates with performance, with 70B+ models outperforming 8B models by 45-60%
2. A 25-35% performance gap exists between high-resource languages (German, Chinese) and low-resource languages (Yoruba, Nepali)
3. Significant differences between localized and machine-translated evaluations, especially for complex instructions
## Contact
For questions or suggestions, please submit a GitHub issue or contact us:
- Email: lyuchenyang.lcy@alibaba-inc.com
- Project homepage: https://github.com/AIDC-AI/Marco-Bench-MIF
## License
This dataset is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
## Acknowledgments
Special thanks to all annotators and translators who participated in dataset construction and validation. This project is supported by Alibaba International Digital Commerce Group.
# Marco-Bench-MIF:面向多语言指令遵循能力评估的基准数据集
[](https://www.apache.org/licenses/LICENSE-2.0)
[](https://aclanthology.org/2025.acl-long.1172/)
[](https://arxiv.org/abs/2507.11882)
## 引言
Marco-Bench-MIF是首个面向30种语言、经过深度本土化优化的多语言基准数据集,用于评估大语言模型(Large Language Model)的指令遵循能力。与当前主流依赖机器翻译的基准数据集不同,Marco-Bench-MIF通过精细化的文化适配流程,能够提供更为精准的模型性能评估。本研究表明,在多语言场景下,仅使用机器翻译生成的数据集会低估模型性能7%-22%。
## 核心特性
- **覆盖语种广泛**:涵盖6大语系的30种语言,包含高资源语言(英语、中文、德语)与低资源语言(约鲁巴语、尼泊尔语)
- **深度文化适配**:通过词汇替换、主题转换、语用重构三步流程,确保数据集在文化与语言层面的适配性
- **约束类型多样**:包含541组指令-响应对,覆盖单/多约束、表达型/内容型约束以及多种指令类型
- **可对比数据集**:针对阿拉伯语、中文、西班牙语等特定语言,同时提供机器翻译版本与文化适配版本,支持对比研究
## 数据集获取
本数据集将通过GitHub仓库与Hugging Face平台发布:
bash
git clone https://github.com/AIDC-AI/Marco-Bench-MIF.git
## 核心发现
本基准数据集对20余款大语言模型进行了评估,结果显示:
1. 模型规模与性能呈强正相关,70B参数量以上的模型性能较8B参数量模型高出45%-60%
2. 高资源语言(德语、中文)与低资源语言(约鲁巴语、尼泊尔语)之间存在25%-35%的性能差距
3. 文化适配版与机器翻译版的评估结果存在显著差异,在复杂指令场景下尤为明显
## 联系方式
如有疑问或建议,请提交GitHub Issue或通过以下方式联系我们:
- 邮箱:lyuchenyang.lcy@alibaba-inc.com
- 项目主页:https://github.com/AIDC-AI/Marco-Bench-MIF
## 许可证
本数据集采用[Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)许可证进行授权。
## 致谢
特别感谢所有参与数据集构建与验证的标注人员与翻译人员。本项目得到阿里巴巴国际数字商业集团的支持。
提供机构:
maas
创建时间:
2025-10-27



