MMAU-Pro

Name: MMAU-Pro
Creator: maas
Published: 2026-01-06 16:45:26
License: 暂无描述

魔搭社区2026-01-06 更新2025-09-13 收录

下载链接：

https://modelscope.cn/datasets/gamma-lab-umd/MMAU-Pro

下载链接

链接失效反馈

官方服务：

资源简介：

# MMAU-Pro: A Challenging and Comprehensive Benchmark for Audio General Intelligence [![Paper](https://img.shields.io/badge/arxiv-%20PDF-red)](https://www.arxiv.org/pdf/2508.13992) [![Audios](https://img.shields.io/badge/🔈%20-Audios-blue)](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro/blob/main/data.zip) [MMAU-Pro](https://arxiv.org/abs/2508.13992) is the most comprehensive benchmark to date for evaluating **audio intelligence in multimodal models**. It spans speech, environmental sounds, music, and their combinations—covering **49 distinct perceptual and reasoning skills**. The dataset contains **5,305 expert-annotated question–answer pairs**, with audios sourced directly *from the wild*. It introduces several novel challenges overlooked by prior benchmarks, including: - Long-form audio understanding (up to 10 minutes) - Multi-audio reasoning - Spatial audio perception - Multicultural music reasoning - Voice-based STEM and world-knowledge QA - Instruction-following with verifiable constraints - Open-ended QA in addition to MCQs --- 🚀 Usage You can load the dataset via Hugging Face datasets: ``` from datasets import load_dataset ds = load_dataset("gamma-lab-umd/MMAU-Pro") ``` For evaluation, we provide: - MCQ scoring via embedding similarity (NV-Embed-v2) - Open-ended QA with LLM-as-a-judge - Regex based string matching for Instruction Following --- 🧪 Baselines & Model Performance We benchmarked 22 leading models on MMAU-Pro. - Gemini 2.5 Flash (closed-source): 59.2% avg. accuracy - Audio Flamingo 3 (open-source): 51.7% - Qwen2.5-Omni-7B: 52.2% - Humans: ~78% See full results in the paper. --- 🌍 Multicultural Music Coverage MMAU-Pro includes music from 8 diverse regions: • Western, Chinese, Indian, European, African, Latin American, Middle Eastern, Other Asian This reveals clear biases: models perform well on Western/Chinese but poorly on Indian/Latin American music. --- 📥 Download - Dataset: [HF](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro) - Paper: [MMAU-Pro](https://arxiv.org/abs/2508.13992) - Website: [Official Page](https://sonalkum.github.io/mmau-pro/) - Github: [Git](https://github.com/sonalkum/MMAUPro) --- 🧩 Evaluation The evaluation code is designed to take in the complete `test.parquet` with predictions in the column `model_ouput`. ``` python evaluate_mmau_pro_comprehensive.py test.parquet --model_output_column model_output ``` --- ✍️ Citation If you use MMAU-Pro, please cite: ```bibtex @article{kumar2025mmau, title={MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence}, author={Kumar, Sonal and Sedl{\'a}{\v{c}}ek, {\v{S}}imon and Lokegaonkar, Vaibhavi and L{\'o}pez, Fernando and Yu, Wenyi and Anand, Nishit and Ryu, Hyeonggon and Chen, Lichang and Pli{\v{c}}ka, Maxim and Hlav{\'a}{\v{c}}ek, Miroslav and others}, journal={arXiv preprint arXiv:2508.13992}, year={2025} } ``` --- 🙏 Acknowledgments Some work was carried out at JSALT 2025.

# MMAU-Pro：面向音频通用智能的高挑战性综合基准测试 [![论文](https://img.shields.io/badge/arxiv-%20PDF-red)](https://www.arxiv.org/pdf/2508.13992) [![音频](https://img.shields.io/badge/🔈%20-Audios-blue)](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro/blob/main/data.zip) [MMAU-Pro](https://arxiv.org/abs/2508.13992) 是目前为止用于评估**多模态模型中的音频智能**的最全面基准测试集。其覆盖语音、环境声、音乐及其组合场景，共包含**49种不同的感知与推理能力维度**。该数据集包含**5305条经专家标注的问答对**，音频素材均直接采集自真实场景。该基准引入了此前同类基准所忽略的多项全新挑战，包括： - 长时音频理解（最长可达10分钟） - 多音频联合推理 - 空间音频感知 - 跨文化音乐推理 - 基于语音的理工科与全球常识问答 - 带有可验证约束的指令遵循任务 - 除多项选择题（Multiple Choice Questions, MCQs）之外的开放式问答 --- 🚀 使用方法你可以通过Hugging Face数据集库加载该数据集： python from datasets import load_dataset ds = load_dataset("gamma-lab-umd/MMAU-Pro") 在评估环节，我们提供了以下方案： - 基于嵌入相似度的多项选择题评分（使用NV-Embed-v2模型） - 采用“大语言模型（Large Language Model, LLM）作为评判者”的开放式问答评估 - 基于正则表达式的字符串匹配，用于指令遵循任务评估 --- 🧪 基准模型与模型性能我们在MMAU-Pro上对22款主流模型进行了基准测试： - Gemini 2.5 Flash（闭源模型）：平均准确率59.2% - Audio Flamingo 3（开源模型）：51.7% - Qwen2.5-Omni-7B：52.2% - 人类受试者：约78% 完整测试结果请参阅论文。 --- 🌍 跨文化音乐覆盖范围 MMAU-Pro包含来自8个不同地域的音乐素材： • 西方、中国、印度、欧洲、非洲、拉丁美洲、中东、其他亚洲地区该数据集揭示了明显的模型性能偏差：模型在西方与中国音乐上表现优异，但在印度、拉丁美洲音乐上性能较差。 --- 📥 下载方式 - 数据集：[Hugging Face (HF)](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro) - 论文：[MMAU-Pro](https://arxiv.org/abs/2508.13992) - 官方网站：[官方页面](https://sonalkum.github.io/mmau-pro/) - 代码仓库：[GitHub](https://github.com/sonalkum/MMAUPro) --- 🧩 评估方法评估代码设计用于读取完整的`test.parquet`文件，预测结果需存储在`model_output`列中。 bash python evaluate_mmau_pro_comprehensive.py test.parquet --model_output_column model_output --- ✍️ 引用方式若您使用MMAU-Pro，请引用以下文献： bibtex @article{kumar2025mmau, title={MMAU-Pro: 面向音频通用智能的全方位评估高挑战性综合基准测试}, author={Kumar, Sonal and Sedláček, Šimon and Lokegaonkar, Vaibhavi and López, Fernando and Yu, Wenyi and Anand, Nishit and Ryu, Hyeonggon and Chen, Lichang and Plička, Maxim and Hlaváček, Miroslav and others}, journal={arXiv预印本 arXiv:2508.13992}, year={2025} } --- 🙏 致谢部分研究工作在JSALT 2025期间完成。

提供机构：

maas

创建时间：

2025-09-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集