MMAU-Pro
收藏魔搭社区2026-01-06 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/gamma-lab-umd/MMAU-Pro
下载链接
链接失效反馈官方服务:
资源简介:
# MMAU-Pro: A Challenging and Comprehensive Benchmark for Audio General Intelligence
[](https://www.arxiv.org/pdf/2508.13992) [](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro/blob/main/data.zip)
[MMAU-Pro](https://arxiv.org/abs/2508.13992) is the most comprehensive benchmark to date for evaluating **audio intelligence in multimodal models**. It spans speech, environmental sounds, music, and their combinations—covering **49 distinct perceptual and reasoning skills**.
The dataset contains **5,305 expert-annotated question–answer pairs**, with audios sourced directly *from the wild*. It introduces several novel challenges overlooked by prior benchmarks, including:
- Long-form audio understanding (up to 10 minutes)
- Multi-audio reasoning
- Spatial audio perception
- Multicultural music reasoning
- Voice-based STEM and world-knowledge QA
- Instruction-following with verifiable constraints
- Open-ended QA in addition to MCQs
---
🚀 Usage
You can load the dataset via Hugging Face datasets:
```
from datasets import load_dataset
ds = load_dataset("gamma-lab-umd/MMAU-Pro")
```
For evaluation, we provide:
- MCQ scoring via embedding similarity (NV-Embed-v2)
- Open-ended QA with LLM-as-a-judge
- Regex based string matching for Instruction Following
---
🧪 Baselines & Model Performance
We benchmarked 22 leading models on MMAU-Pro.
- Gemini 2.5 Flash (closed-source): 59.2% avg. accuracy
- Audio Flamingo 3 (open-source): 51.7%
- Qwen2.5-Omni-7B: 52.2%
- Humans: ~78%
See full results in the paper.
---
🌍 Multicultural Music Coverage
MMAU-Pro includes music from 8 diverse regions:
• Western, Chinese, Indian, European, African, Latin American, Middle Eastern, Other Asian
This reveals clear biases: models perform well on Western/Chinese but poorly on Indian/Latin American music.
---
📥 Download
- Dataset: [HF](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro)
- Paper: [MMAU-Pro](https://arxiv.org/abs/2508.13992)
- Website: [Official Page](https://sonalkum.github.io/mmau-pro/)
- Github: [Git](https://github.com/sonalkum/MMAUPro)
---
🧩 Evaluation
The evaluation code is designed to take in the complete `test.parquet` with predictions in the column `model_ouput`.
```
python evaluate_mmau_pro_comprehensive.py test.parquet --model_output_column model_output
```
---
✍️ Citation
If you use MMAU-Pro, please cite:
```bibtex
@article{kumar2025mmau,
title={MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence},
author={Kumar, Sonal and Sedl{\'a}{\v{c}}ek, {\v{S}}imon and Lokegaonkar, Vaibhavi and L{\'o}pez, Fernando and Yu, Wenyi and Anand, Nishit and Ryu, Hyeonggon and Chen, Lichang and Pli{\v{c}}ka, Maxim and Hlav{\'a}{\v{c}}ek, Miroslav and others},
journal={arXiv preprint arXiv:2508.13992},
year={2025}
}
```
---
🙏 Acknowledgments
Some work was carried out at JSALT 2025.
# MMAU-Pro:面向音频通用智能的高挑战性综合基准测试
[](https://www.arxiv.org/pdf/2508.13992) [](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro/blob/main/data.zip)
[MMAU-Pro](https://arxiv.org/abs/2508.13992) 是目前为止用于评估**多模态模型中的音频智能**的最全面基准测试集。其覆盖语音、环境声、音乐及其组合场景,共包含**49种不同的感知与推理能力维度**。
该数据集包含**5305条经专家标注的问答对**,音频素材均直接采集自真实场景。该基准引入了此前同类基准所忽略的多项全新挑战,包括:
- 长时音频理解(最长可达10分钟)
- 多音频联合推理
- 空间音频感知
- 跨文化音乐推理
- 基于语音的理工科与全球常识问答
- 带有可验证约束的指令遵循任务
- 除多项选择题(Multiple Choice Questions, MCQs)之外的开放式问答
---
🚀 使用方法
你可以通过Hugging Face数据集库加载该数据集:
python
from datasets import load_dataset
ds = load_dataset("gamma-lab-umd/MMAU-Pro")
在评估环节,我们提供了以下方案:
- 基于嵌入相似度的多项选择题评分(使用NV-Embed-v2模型)
- 采用“大语言模型(Large Language Model, LLM)作为评判者”的开放式问答评估
- 基于正则表达式的字符串匹配,用于指令遵循任务评估
---
🧪 基准模型与模型性能
我们在MMAU-Pro上对22款主流模型进行了基准测试:
- Gemini 2.5 Flash(闭源模型):平均准确率59.2%
- Audio Flamingo 3(开源模型):51.7%
- Qwen2.5-Omni-7B:52.2%
- 人类受试者:约78%
完整测试结果请参阅论文。
---
🌍 跨文化音乐覆盖范围
MMAU-Pro包含来自8个不同地域的音乐素材:
• 西方、中国、印度、欧洲、非洲、拉丁美洲、中东、其他亚洲地区
该数据集揭示了明显的模型性能偏差:模型在西方与中国音乐上表现优异,但在印度、拉丁美洲音乐上性能较差。
---
📥 下载方式
- 数据集:[Hugging Face (HF)](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro)
- 论文:[MMAU-Pro](https://arxiv.org/abs/2508.13992)
- 官方网站:[官方页面](https://sonalkum.github.io/mmau-pro/)
- 代码仓库:[GitHub](https://github.com/sonalkum/MMAUPro)
---
🧩 评估方法
评估代码设计用于读取完整的`test.parquet`文件,预测结果需存储在`model_output`列中。
bash
python evaluate_mmau_pro_comprehensive.py test.parquet --model_output_column model_output
---
✍️ 引用方式
若您使用MMAU-Pro,请引用以下文献:
bibtex
@article{kumar2025mmau,
title={MMAU-Pro: 面向音频通用智能的全方位评估高挑战性综合基准测试},
author={Kumar, Sonal and Sedláček, Šimon and Lokegaonkar, Vaibhavi and López, Fernando and Yu, Wenyi and Anand, Nishit and Ryu, Hyeonggon and Chen, Lichang and Plička, Maxim and Hlaváček, Miroslav and others},
journal={arXiv预印本 arXiv:2508.13992},
year={2025}
}
---
🙏 致谢
部分研究工作在JSALT 2025期间完成。
提供机构:
maas
创建时间:
2025-09-04



