lihVerma/MAPLE-bench
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/lihVerma/MAPLE-bench
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: cc-by-nc-nd-4.0
task_categories:
- visual-question-answering
tags:
- multimodal
- benchmark
- QA
- captioning
- video-audio-subtitle
- modality-aware
- reinforcement-learning
- grpo
size_categories:
- 1K<n<10K
- 10K<n<100K
---
# MAPLE Benchmark Test Splits
This repository contains the **test splits** for the MAPLE benchmark introduced in the paper *MAPLE: Modality-Aware Post-training and Learning Ecosystem* (https://arxiv.org/pdf/2602.11596). The benchmark is designed for modality-aware multimodal evaluation under different required-signal settings, where each sample is annotated with the minimal modality subset needed to solve the task.
## Dataset Overview
MAPLE-bench evaluates multimodal reasoning across **video, audio, and subtitles** with explicit modality requirements. The benchmark supports two tasks:
- **MAPLE-QA**: multiple-choice question answering with verifiable answers.
- **MAPLE-Caption**: open-ended caption generation.
The benchmark is built to study performance under modality-exact, modality-superset, and modality-deficit conditions, and to distinguish true reasoning ability from failures caused by missing or extra signals.
## Shared Test Splits Included
This dataset card provides the manually verified test splits shared with the paper:
- `MAPLE_caption` — **5,348** test samples.
- `MAPLE_QA_exact` — **5,001** test samples.
- `MAPLE_QA_all_combinations` — **34,048** test samples for the extended QA ablation dataset.
All data splits were manually verified and are, to the best of our knowledge, faithful to the paper’s annotations.
## Data Format
### MAPLE_caption
Captioning samples are modality-tagged examples intended for open-ended generation. Each item corresponds to a specific modality condition and contains the metadata needed to evaluate caption quality under the available inputs. The video ids refer to [VAST-omni dataset](https://proceedings.neurips.cc/paper_files/paper/2023/file/e6b2b48b5ed90d07c305932729927781-Paper-Conference.pdf).
### MAPLE_QA_exact
This split contains the standard QA benchmark test set. Each item is a multiple-choice question with four options and a single gold answer. The answer is designed to be verifiable under the required modality subset. The video ids refer to [Daily-omni dataset]( https://arxiv.org/abs/2505.17862).
### MAPLE_QA_all_combinations
This extended QA split expands the benchmark into modality-exact, modality-superset, and modality-deficit settings. It is intended for ablation and robustness evaluation across all combinations of available modalities. The video ids refer to [Daily-omni dataset]( https://arxiv.org/abs/2505.17862).
## Benchmark Purpose
MAPLE is intended to measure whether multimodal models can:
- learn which modalities are necessary for a task,
- avoid over-relying on irrelevant modalities,
- remain accurate under partial-signal conditions,
- abstain when information is insufficient.
The paper motivates this through a modality-aware post-training framework, but this repository only releases the benchmark test data for evaluation.
## Usage Notes
- Use the test splits only for evaluation.
- For QA, the expected output is a single multiple-choice option.
- For captioning, predictions should be compared against the reference caption for the corresponding modality tag.
- Any prompts used to create dataset or evaluate performance are mentioned in paper's Appendix.
## Citation
If you use this benchmark, please cite the associated paper:
```bibtex
@article{maple2026,
title={MAPLE: Modality-Aware Post-training and Learning Ecosystem},
author={Verma, Nikhil and Kim, Minjung and Yoo, JooYoung and Jin, Kyung-Min and Bharadwaj, Manasa and Ferreira, Kevin and Kim, Ko Keun and Kim, Youngjoon},
journal={arXiv preprint arXiv:2602.11596},
year={2026},
url={https://arxiv.org/pdf/2602.11596}
}
```
提供机构:
lihVerma



