five

MMAU-Pro

收藏
魔搭社区2026-01-06 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/gamma-lab-umd/MMAU-Pro
下载链接
链接失效反馈
官方服务:
资源简介:
# MMAU-Pro: A Challenging and Comprehensive Benchmark for Audio General Intelligence [![Paper](https://img.shields.io/badge/arxiv-%20PDF-red)](https://www.arxiv.org/pdf/2508.13992) [![Audios](https://img.shields.io/badge/🔈%20-Audios-blue)](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro/blob/main/data.zip) [MMAU-Pro](https://arxiv.org/abs/2508.13992) is the most comprehensive benchmark to date for evaluating **audio intelligence in multimodal models**. It spans speech, environmental sounds, music, and their combinations—covering **49 distinct perceptual and reasoning skills**. The dataset contains **5,305 expert-annotated question–answer pairs**, with audios sourced directly *from the wild*. It introduces several novel challenges overlooked by prior benchmarks, including: - Long-form audio understanding (up to 10 minutes) - Multi-audio reasoning - Spatial audio perception - Multicultural music reasoning - Voice-based STEM and world-knowledge QA - Instruction-following with verifiable constraints - Open-ended QA in addition to MCQs --- 🚀 Usage You can load the dataset via Hugging Face datasets: ``` from datasets import load_dataset ds = load_dataset("gamma-lab-umd/MMAU-Pro") ``` For evaluation, we provide: - MCQ scoring via embedding similarity (NV-Embed-v2) - Open-ended QA with LLM-as-a-judge - Regex based string matching for Instruction Following --- 🧪 Baselines & Model Performance We benchmarked 22 leading models on MMAU-Pro. - Gemini 2.5 Flash (closed-source): 59.2% avg. accuracy - Audio Flamingo 3 (open-source): 51.7% - Qwen2.5-Omni-7B: 52.2% - Humans: ~78% See full results in the paper. --- 🌍 Multicultural Music Coverage MMAU-Pro includes music from 8 diverse regions: • Western, Chinese, Indian, European, African, Latin American, Middle Eastern, Other Asian This reveals clear biases: models perform well on Western/Chinese but poorly on Indian/Latin American music. --- 📥 Download - Dataset: [HF](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro) - Paper: [MMAU-Pro](https://arxiv.org/abs/2508.13992) - Website: [Official Page](https://sonalkum.github.io/mmau-pro/) - Github: [Git](https://github.com/sonalkum/MMAUPro) --- 🧩 Evaluation The evaluation code is designed to take in the complete `test.parquet` with predictions in the column `model_ouput`. ``` python evaluate_mmau_pro_comprehensive.py test.parquet --model_output_column model_output ``` --- ✍️ Citation If you use MMAU-Pro, please cite: ```bibtex @article{kumar2025mmau, title={MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence}, author={Kumar, Sonal and Sedl{\'a}{\v{c}}ek, {\v{S}}imon and Lokegaonkar, Vaibhavi and L{\'o}pez, Fernando and Yu, Wenyi and Anand, Nishit and Ryu, Hyeonggon and Chen, Lichang and Pli{\v{c}}ka, Maxim and Hlav{\'a}{\v{c}}ek, Miroslav and others}, journal={arXiv preprint arXiv:2508.13992}, year={2025} } ``` --- 🙏 Acknowledgments Some work was carried out at JSALT 2025.

# MMAU-Pro:面向音频通用智能的高挑战性综合基准测试 [![论文](https://img.shields.io/badge/arxiv-%20PDF-red)](https://www.arxiv.org/pdf/2508.13992) [![音频](https://img.shields.io/badge/🔈%20-Audios-blue)](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro/blob/main/data.zip) [MMAU-Pro](https://arxiv.org/abs/2508.13992) 是目前为止用于评估**多模态模型中的音频智能**的最全面基准测试集。其覆盖语音、环境声、音乐及其组合场景,共包含**49种不同的感知与推理能力维度**。 该数据集包含**5305条经专家标注的问答对**,音频素材均直接采集自真实场景。该基准引入了此前同类基准所忽略的多项全新挑战,包括: - 长时音频理解(最长可达10分钟) - 多音频联合推理 - 空间音频感知 - 跨文化音乐推理 - 基于语音的理工科与全球常识问答 - 带有可验证约束的指令遵循任务 - 除多项选择题(Multiple Choice Questions, MCQs)之外的开放式问答 --- 🚀 使用方法 你可以通过Hugging Face数据集库加载该数据集: python from datasets import load_dataset ds = load_dataset("gamma-lab-umd/MMAU-Pro") 在评估环节,我们提供了以下方案: - 基于嵌入相似度的多项选择题评分(使用NV-Embed-v2模型) - 采用“大语言模型(Large Language Model, LLM)作为评判者”的开放式问答评估 - 基于正则表达式的字符串匹配,用于指令遵循任务评估 --- 🧪 基准模型与模型性能 我们在MMAU-Pro上对22款主流模型进行了基准测试: - Gemini 2.5 Flash(闭源模型):平均准确率59.2% - Audio Flamingo 3(开源模型):51.7% - Qwen2.5-Omni-7B:52.2% - 人类受试者:约78% 完整测试结果请参阅论文。 --- 🌍 跨文化音乐覆盖范围 MMAU-Pro包含来自8个不同地域的音乐素材: • 西方、中国、印度、欧洲、非洲、拉丁美洲、中东、其他亚洲地区 该数据集揭示了明显的模型性能偏差:模型在西方与中国音乐上表现优异,但在印度、拉丁美洲音乐上性能较差。 --- 📥 下载方式 - 数据集:[Hugging Face (HF)](https://huggingface.co/datasets/gamma-lab-umd/MMAU-Pro) - 论文:[MMAU-Pro](https://arxiv.org/abs/2508.13992) - 官方网站:[官方页面](https://sonalkum.github.io/mmau-pro/) - 代码仓库:[GitHub](https://github.com/sonalkum/MMAUPro) --- 🧩 评估方法 评估代码设计用于读取完整的`test.parquet`文件,预测结果需存储在`model_output`列中。 bash python evaluate_mmau_pro_comprehensive.py test.parquet --model_output_column model_output --- ✍️ 引用方式 若您使用MMAU-Pro,请引用以下文献: bibtex @article{kumar2025mmau, title={MMAU-Pro: 面向音频通用智能的全方位评估高挑战性综合基准测试}, author={Kumar, Sonal and Sedláček, Šimon and Lokegaonkar, Vaibhavi and López, Fernando and Yu, Wenyi and Anand, Nishit and Ryu, Hyeonggon and Chen, Lichang and Plička, Maxim and Hlaváček, Miroslav and others}, journal={arXiv预印本 arXiv:2508.13992}, year={2025} } --- 🙏 致谢 部分研究工作在JSALT 2025期间完成。
提供机构:
maas
创建时间:
2025-09-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作