five

RUC-NLPIR/OmniGAIA

收藏
Hugging Face2026-03-07 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/RUC-NLPIR/OmniGAIA
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering - visual-question-answering language: - en pretty_name: OmniGAIA size_categories: - n<1K tags: - multimodal - benchmark - agent - tool-use configs: - config_name: default data_files: - split: test path: data/test-*.parquet --- <h1 style="text-align: left; font-size: 1.6em; margin-bottom: 0.75em;"> <span style="color:#1628a7; font-weight:bold;">O</span><span style="color:#402b94; font-weight:bold;">m</span><span style="color:#673ea0; font-weight:bold;">n</span><span style="color:#8b16aa; font-weight:bold;">i</span>GAIA: Omni-Modal General AI Assistant Benchmark </h1> <div style="text-align: left; margin-bottom: 18px;"> <a href="https://arxiv.org/abs/2602.22897" target="_blank">📄 Paper</a> &nbsp; • &nbsp; <a href="https://github.com/RUC-NLPIR/OmniGAIA" target="_blank">💻 Code & Demo</a> &nbsp; • &nbsp; <a href="https://huggingface.co/collections/RUC-NLPIR/omnigaia" target="_blank">🤗 Dataset & Model</a> &nbsp; • &nbsp; <a href="https://huggingface.co/spaces/RUC-NLPIR/OmniGAIA-Leaderboard" target="_blank">📈 Leaderboard</a> </div> OmniGAIA is a benchmark for <span style="text-decoration: underline;">Omni</span>-Modal <span style="text-decoration: underline;">G</span>eneral <span style="text-decoration: underline;">AI</span> <span style="text-decoration: underline;">A</span>ssistants that jointly reason over vision, audio, and language with external tools. It is designed to evaluate long-horizon, multi-hop, open-form problem solving in realistic settings rather than short perception-only QA. ## Benchmark Construction <div align="left"> <img src="./assets/omnigaia_construction.png" width="95%" /> </div> The OmniGAIA construction pipeline consists of four stages: 1. **Data Collection** — Curating video (with audio) and image+audio sources from FineVideo, LongVideoBench, LongVideo-Reason, COCO 2017, and HuggingFace, covering 100+ diverse domains. 2. **Valuable Information Discovery** — Using Gemini-3-Flash to extract events, environmental analysis, audio analysis (ASR, speaker ID), and image understanding (OCR, objects, faces). 3. **Agentic Omni-Modal Event Graph Construction** — DeepSeek-V3.2 iteratively expands an initial event graph by planning next steps, acquiring new information via tools, and verifying factual correctness with LLM self-reflexion and human review. 4. **QA Generation & Quality Review** — Generating difficult, multi-hop QA pairs through event fuzzification, followed by LLM and human verification for correctness, task difficulty, answer uniqueness. ## Benchmark Statistics <div align="left"> <img src="./assets/omnigaia_statistics.png" width="95%" /> </div> Key numbers: - **360** QA pairs across **9** domains (Geography, History, Technology, Sports, Arts, Movies, Science, Finance, Food) - **3** difficulty levels — Easy (33.9%), Medium (44.4%), Hard (21.7%) - **Median video duration:** 242.2s | **Median audio duration:** 197.0s - **99.7%** of tasks require visual perception; **99.7%** require audio perception - **98.6%** require web search; **74.4%** require code / computation ## Task Examples <div align="left"> <img src="./assets/omnigaia_examples.png" width="95%" /> </div> ## Data Format Each row is one benchmark task. | Field | Type | Description | |---|---|---| | `id` | int | Task identifier | | `question` | string | User question | | `image_1`,`image_2`,`image_3` | Image / null | Image inputs (if any) | | `audio_1`,`audio_2`,`audio_3` | string / null | Audio file URLs in this dataset repo | | `video_1`,`video_2`,`video_3` | string / null | Video file URLs in this dataset repo | | `annotated_solution` | list[string] | Step-by-step reference reasoning | | `sources_json` | string | JSON-encoded evidence sources | | `omni_modal_input_json` | string | JSON-encoded original multimodal metadata | | `answer` | string | Ground-truth answer | | `level` | string | Difficulty (`Easy/Medium/Hard`) | | `total_steps` | int | Number of reference reasoning steps | | `task_type` | string | Task setting/type | | `category` | string | Domain category | | `required_external_tools` | list[string] | Tools required by annotation | ## Evaluation The leaderboard reports **Pass@1 Accuracy (%)** on the official test split. Task correctness follows a two-stage protocol: 1. **Exact Match (EM):** extract text between `<answer>` and `</answer>` in model output and compare with label. 2. **LLM-as-a-Judge fallback:** if EM fails, judge semantic equivalence (DeepSeek-V3.2 in the paper/leaderboard pipeline). All compared models are evaluated under the same tool setting (web search, browser, code executor). ## Leaderboard | Rank | Model | Overall | Easy | Med | Hard | |---:|---|---:|---:|---:|---:| | 1 | Gemini-3-Pro | 62.5 | 78.7 | 61.9 | 38.5 | | 2 | Gemini-3-Flash | 51.7 | 67.2 | 46.9 | 37.2 | | 3 | Gemini-2.5-Pro | 30.8 | 41.8 | 26.9 | 21.8 | | 4 | OmniAtlas-Qwen3-30B | 20.8 | 31.1 | 18.8 | 9.0 | | 5 | Qwen3-Omni-30B | 13.3 | 19.7 | 10.6 | 9.0 | | 6 | OmniAtlas-Qwen2.5-7B | 13.3 | 22.1 | 11.3 | 3.9 | | 7 | LongCat-Flash-Omni-560B | 11.1 | 16.4 | 9.4 | 6.4 | | 8 | OmniAtlas-Qwen2.5-3B | 10.3 | 13.9 | 10.0 | 5.1 | | 9 | Gemini-2.5-Flash-Lite | 8.6 | 9.8 | 8.1 | 7.7 | | 10 | Ming-Flash-Omni-100B | 8.3 | 12.3 | 7.5 | 3.8 | | 11 | Ming-Lite-Omni-1.5-20B | 3.9 | 4.9 | 3.8 | 2.6 | | 12 | Qwen2.5-Omni-7B | 3.6 | 8.2 | 1.3 | 1.3 | | 13 | MiniCPM-O-2.6-8B | 3.1 | 3.3 | 2.5 | 3.8 | | 14 | Baichuan-Omni-1.5-8B | 2.8 | 4.9 | 2.5 | 0.0 | | 15 | Qwen2.5-Omni-3B | 1.4 | 1.6 | 1.9 | 0.0 | Official leaderboard space: https://huggingface.co/spaces/RUC-NLPIR/OmniGAIA-Leaderboard ## Citation If you find OmniGAIA useful in your work, we kindly ask that you cite us: ```bibtex @misc{li2026omnigaia, title={OmniGAIA: Towards Native Omni-Modal AI Agents}, author={Xiaoxi Li and Wenxiang Jiao and Jiarui Jin and Shijian Wang and Guanting Dong and Jiajie Jin and Hao Wang and Yinuo Wang and Ji-Rong Wen and Yuan Lu and Zhicheng Dou}, year={2026}, eprint={2602.22897}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2602.22897}, } ```
提供机构:
RUC-NLPIR
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作