lihVerma/MAPLE-bench

Name: lihVerma/MAPLE-bench
Creator: lihVerma
Published: 2026-04-17 15:22:58
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/lihVerma/MAPLE-bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: en license: cc-by-nc-nd-4.0 task_categories: - visual-question-answering tags: - multimodal - benchmark - QA - captioning - video-audio-subtitle - modality-aware - reinforcement-learning - grpo size_categories: - 1K<n<10K - 10K<n<100K --- # MAPLE Benchmark Test Splits This repository contains the **test splits** for the MAPLE benchmark introduced in the paper *MAPLE: Modality-Aware Post-training and Learning Ecosystem* (https://arxiv.org/pdf/2602.11596). The benchmark is designed for modality-aware multimodal evaluation under different required-signal settings, where each sample is annotated with the minimal modality subset needed to solve the task. ## Dataset Overview MAPLE-bench evaluates multimodal reasoning across **video, audio, and subtitles** with explicit modality requirements. The benchmark supports two tasks: - **MAPLE-QA**: multiple-choice question answering with verifiable answers. - **MAPLE-Caption**: open-ended caption generation. The benchmark is built to study performance under modality-exact, modality-superset, and modality-deficit conditions, and to distinguish true reasoning ability from failures caused by missing or extra signals. ## Shared Test Splits Included This dataset card provides the manually verified test splits shared with the paper: - `MAPLE_caption` — **5,348** test samples. - `MAPLE_QA_exact` — **5,001** test samples. - `MAPLE_QA_all_combinations` — **34,048** test samples for the extended QA ablation dataset. All data splits were manually verified and are, to the best of our knowledge, faithful to the paper’s annotations. ## Data Format ### MAPLE_caption Captioning samples are modality-tagged examples intended for open-ended generation. Each item corresponds to a specific modality condition and contains the metadata needed to evaluate caption quality under the available inputs. The video ids refer to [VAST-omni dataset](https://proceedings.neurips.cc/paper_files/paper/2023/file/e6b2b48b5ed90d07c305932729927781-Paper-Conference.pdf). ### MAPLE_QA_exact This split contains the standard QA benchmark test set. Each item is a multiple-choice question with four options and a single gold answer. The answer is designed to be verifiable under the required modality subset. The video ids refer to [Daily-omni dataset]( https://arxiv.org/abs/2505.17862). ### MAPLE_QA_all_combinations This extended QA split expands the benchmark into modality-exact, modality-superset, and modality-deficit settings. It is intended for ablation and robustness evaluation across all combinations of available modalities. The video ids refer to [Daily-omni dataset]( https://arxiv.org/abs/2505.17862). ## Benchmark Purpose MAPLE is intended to measure whether multimodal models can: - learn which modalities are necessary for a task, - avoid over-relying on irrelevant modalities, - remain accurate under partial-signal conditions, - abstain when information is insufficient. The paper motivates this through a modality-aware post-training framework, but this repository only releases the benchmark test data for evaluation. ## Usage Notes - Use the test splits only for evaluation. - For QA, the expected output is a single multiple-choice option. - For captioning, predictions should be compared against the reference caption for the corresponding modality tag. - Any prompts used to create dataset or evaluate performance are mentioned in paper's Appendix. ## Citation If you use this benchmark, please cite the associated paper: ```bibtex @article{maple2026, title={MAPLE: Modality-Aware Post-training and Learning Ecosystem}, author={Verma, Nikhil and Kim, Minjung and Yoo, JooYoung and Jin, Kyung-Min and Bharadwaj, Manasa and Ferreira, Kevin and Kim, Ko Keun and Kim, Youngjoon}, journal={arXiv preprint arXiv:2602.11596}, year={2026}, url={https://arxiv.org/pdf/2602.11596} } ```

提供机构：

lihVerma

5,000+

优质数据集

54 个

任务类型

进入经典数据集