RUC-NLPIR/OmniGAIA
收藏Hugging Face2026-03-07 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/RUC-NLPIR/OmniGAIA
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- visual-question-answering
language:
- en
pretty_name: OmniGAIA
size_categories:
- n<1K
tags:
- multimodal
- benchmark
- agent
- tool-use
configs:
- config_name: default
data_files:
- split: test
path: data/test-*.parquet
---
<h1 style="text-align: left; font-size: 1.6em; margin-bottom: 0.75em;">
<span style="color:#1628a7; font-weight:bold;">O</span><span style="color:#402b94; font-weight:bold;">m</span><span style="color:#673ea0; font-weight:bold;">n</span><span style="color:#8b16aa; font-weight:bold;">i</span>GAIA: Omni-Modal General AI Assistant Benchmark
</h1>
<div style="text-align: left; margin-bottom: 18px;">
<a href="https://arxiv.org/abs/2602.22897" target="_blank">📄 Paper</a>
•
<a href="https://github.com/RUC-NLPIR/OmniGAIA" target="_blank">💻 Code & Demo</a>
•
<a href="https://huggingface.co/collections/RUC-NLPIR/omnigaia" target="_blank">🤗 Dataset & Model</a>
•
<a href="https://huggingface.co/spaces/RUC-NLPIR/OmniGAIA-Leaderboard" target="_blank">📈 Leaderboard</a>
</div>
OmniGAIA is a benchmark for <span style="text-decoration: underline;">Omni</span>-Modal <span style="text-decoration: underline;">G</span>eneral <span style="text-decoration: underline;">AI</span> <span style="text-decoration: underline;">A</span>ssistants that jointly reason over vision, audio, and language with external tools. It is designed to evaluate long-horizon, multi-hop, open-form problem solving in realistic settings rather than short perception-only QA.
## Benchmark Construction
<div align="left">
<img src="./assets/omnigaia_construction.png" width="95%" />
</div>
The OmniGAIA construction pipeline consists of four stages:
1. **Data Collection** — Curating video (with audio) and image+audio sources from FineVideo, LongVideoBench, LongVideo-Reason, COCO 2017, and HuggingFace, covering 100+ diverse domains.
2. **Valuable Information Discovery** — Using Gemini-3-Flash to extract events, environmental analysis, audio analysis (ASR, speaker ID), and image understanding (OCR, objects, faces).
3. **Agentic Omni-Modal Event Graph Construction** — DeepSeek-V3.2 iteratively expands an initial event graph by planning next steps, acquiring new information via tools, and verifying factual correctness with LLM self-reflexion and human review.
4. **QA Generation & Quality Review** — Generating difficult, multi-hop QA pairs through event fuzzification, followed by LLM and human verification for correctness, task difficulty, answer uniqueness.
## Benchmark Statistics
<div align="left">
<img src="./assets/omnigaia_statistics.png" width="95%" />
</div>
Key numbers:
- **360** QA pairs across **9** domains (Geography, History, Technology, Sports, Arts, Movies, Science, Finance, Food)
- **3** difficulty levels — Easy (33.9%), Medium (44.4%), Hard (21.7%)
- **Median video duration:** 242.2s | **Median audio duration:** 197.0s
- **99.7%** of tasks require visual perception; **99.7%** require audio perception
- **98.6%** require web search; **74.4%** require code / computation
## Task Examples
<div align="left">
<img src="./assets/omnigaia_examples.png" width="95%" />
</div>
## Data Format
Each row is one benchmark task.
| Field | Type | Description |
|---|---|---|
| `id` | int | Task identifier |
| `question` | string | User question |
| `image_1`,`image_2`,`image_3` | Image / null | Image inputs (if any) |
| `audio_1`,`audio_2`,`audio_3` | string / null | Audio file URLs in this dataset repo |
| `video_1`,`video_2`,`video_3` | string / null | Video file URLs in this dataset repo |
| `annotated_solution` | list[string] | Step-by-step reference reasoning |
| `sources_json` | string | JSON-encoded evidence sources |
| `omni_modal_input_json` | string | JSON-encoded original multimodal metadata |
| `answer` | string | Ground-truth answer |
| `level` | string | Difficulty (`Easy/Medium/Hard`) |
| `total_steps` | int | Number of reference reasoning steps |
| `task_type` | string | Task setting/type |
| `category` | string | Domain category |
| `required_external_tools` | list[string] | Tools required by annotation |
## Evaluation
The leaderboard reports **Pass@1 Accuracy (%)** on the official test split.
Task correctness follows a two-stage protocol:
1. **Exact Match (EM):** extract text between `<answer>` and `</answer>` in model output and compare with label.
2. **LLM-as-a-Judge fallback:** if EM fails, judge semantic equivalence (DeepSeek-V3.2 in the paper/leaderboard pipeline).
All compared models are evaluated under the same tool setting (web search, browser, code executor).
## Leaderboard
| Rank | Model | Overall | Easy | Med | Hard |
|---:|---|---:|---:|---:|---:|
| 1 | Gemini-3-Pro | 62.5 | 78.7 | 61.9 | 38.5 |
| 2 | Gemini-3-Flash | 51.7 | 67.2 | 46.9 | 37.2 |
| 3 | Gemini-2.5-Pro | 30.8 | 41.8 | 26.9 | 21.8 |
| 4 | OmniAtlas-Qwen3-30B | 20.8 | 31.1 | 18.8 | 9.0 |
| 5 | Qwen3-Omni-30B | 13.3 | 19.7 | 10.6 | 9.0 |
| 6 | OmniAtlas-Qwen2.5-7B | 13.3 | 22.1 | 11.3 | 3.9 |
| 7 | LongCat-Flash-Omni-560B | 11.1 | 16.4 | 9.4 | 6.4 |
| 8 | OmniAtlas-Qwen2.5-3B | 10.3 | 13.9 | 10.0 | 5.1 |
| 9 | Gemini-2.5-Flash-Lite | 8.6 | 9.8 | 8.1 | 7.7 |
| 10 | Ming-Flash-Omni-100B | 8.3 | 12.3 | 7.5 | 3.8 |
| 11 | Ming-Lite-Omni-1.5-20B | 3.9 | 4.9 | 3.8 | 2.6 |
| 12 | Qwen2.5-Omni-7B | 3.6 | 8.2 | 1.3 | 1.3 |
| 13 | MiniCPM-O-2.6-8B | 3.1 | 3.3 | 2.5 | 3.8 |
| 14 | Baichuan-Omni-1.5-8B | 2.8 | 4.9 | 2.5 | 0.0 |
| 15 | Qwen2.5-Omni-3B | 1.4 | 1.6 | 1.9 | 0.0 |
Official leaderboard space: https://huggingface.co/spaces/RUC-NLPIR/OmniGAIA-Leaderboard
## Citation
If you find OmniGAIA useful in your work, we kindly ask that you cite us:
```bibtex
@misc{li2026omnigaia,
title={OmniGAIA: Towards Native Omni-Modal AI Agents},
author={Xiaoxi Li and Wenxiang Jiao and Jiarui Jin and Shijian Wang and Guanting Dong and Jiajie Jin and Hao Wang and Yinuo Wang and Ji-Rong Wen and Yuan Lu and Zhicheng Dou},
year={2026},
eprint={2602.22897},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.22897},
}
```
提供机构:
RUC-NLPIR



