UNO-Bench

Name: UNO-Bench
Creator: maas
Published: 2026-05-02 07:11:16
License: 暂无描述

魔搭社区2026-05-02 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/meituan-longcat/UNO-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

<h1> UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models</h1> <p align="center" width="100%"> <img src="assets/uno-bench-title.jpeg" width="80%" height="100%"> </p> <div align="center" style="line-height: 1;"> <a target="_blank" href='https://meituan-longcat.github.io/UNO-Bench'><img src='https://img.shields.io/badge/Project-Page-green'></a> <a target="_blank" href='https://agi-eval.cn/evaluation/detail?id=139'><img src='https://img.shields.io/badge/leaderboard-page-orange'></a> <a target="_blank" href='https://arxiv.org/abs/2510.18915'><img src='https://img.shields.io/badge/Technique-Report-red'></a> <a target="_blank" href='https://huggingface.co/datasets/meituan-longcat/UNO-Bench'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-blue'></a> <a href='./'><img src='https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53'></a> </div> ## 👀 UNO-Bench Overview Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we introduce a novel, high-quality, and **UN**ified **O**mni model benchmark, **UNO-Bench**. This benchmark is designed to effectively evaluate both **UN**i-modal and **O**mni-modal capabilities under a unified ability taxonomy, spanning 44 task types and 5 modality combinations. It includes 1250 human curated samples for omni-modal with 98% cross-modality solvability, and 2480 enhanced uni-modal samples. The human-generated dataset is well-suited to real-world scenarios, particularly within the Chinese context, whereas the automatically compressed dataset offers a 90% increase in speed and maintains 98% consistency across 18 public benchmarks. In addition to traditional multi-choice questions, we propose an innovative multi-step open-ended question format to assess complex reasoning. A general scoring model is incorporated, supporting 6 question types for automated evaluation with 95% accuracy. Experimental result shows the **Compositional Law** between omni-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models. <div> <p align="center"> <img src="./assets/omni-ability.png" width="80%" height="100%" /> </p> </div>  <div> <p align="center"> <img src="./assets/data-statistics.png" width="80%" height="100%" /> </p> </div> **Main Contributions** - 🌟 **Propose UNO-Bench, the first unified omni model benchmark**, efficiently assessing uni-modal and omni-modal understanding. It verifies the compositional law between these capabilities, acting as a bottleneck for weaker models and enhancing stronger ones. - 🌟 **Establish a high-quality dataset pipeline** with human-centric processes and automated compression. UNO-Bench contains 1250 omni-modal samples with 98% cross-modality solvability and 2480 uni-modal samples across 44 task types and 5 modality combinations. The dataset excels in real-world scenarios, especially in China, and offers a 90% speed increase while maintaining 98% consistency across 18 benchmarks. - 🌟 **Introduce Multi-Step Open-Ended Questions (MO)** for complex reasoning evaluation, providing realistic results. A General Scoring Model supports 6 question types with 95% accuracy on OOD models and benchmarks. ## 📊 Dataset Construction **Material Collection** Our materials feature three key characteristics: **a. Diverse Sources**—primarily real-world photos and videos from crowdsourcing, supplemented by copyright-free websites and high-quality public datasets. **b. Rich and Diverse Topics**—spanning society, culture, art, life, literature, and science. **c. Live-Recorded Audio**—dialogue recorded by over 20 human speakers, ensuring rich audio features that mirror real-world vocal diversity. **QA Annotation** Our annotators include human experts and skilled crowd-sourced users. Human experts bring extensive experience in cross-modal data and model understanding, ensuring professional and specific data. Crowd-sourced users, mainly college students, offer authentic and diverse data due to their experience with multi-modal models and varied backgrounds. **Quality Inspection** To ensure data quality, we use a multi-stage quality assurance system combining automated tools and manual review. Each question undergoes three independent inspections: a preliminary model check filters out ambiguous or non-conforming questions; modality ablation experiments test cross-modality solvability by removing one modality; and final manual inspection and revision ensure accuracy. **Data Compression** Regarding automated data compression, we propose a cluster-guided stratified sampling method to compress the scale of 18 public benchmarks and achieve a 90% dataset compression with 98% ranking consistency. <div> <p align="center"> <img src="./assets/omni-data-pipeline.png" width="80%" height="100%" /> </p> </div> ## 📍 Dataset Examples The capabilities of UNO-Bench are systematically categorized into two primary dimensions: Perception and Reasoning. Please click [link](https://huggingface.co/datasets/meituan-longcat/UNO-Bench) to download UNO-Bench. Below shows some examples from UNO-Bench: <p align="center"> <img alt="image2" src="./assets/omni-perception-cases.png" /> </p> --- <p align="center"> <img alt="image2" src="./assets/omni-reasoning-cases.png" /> </p> For more samples, please refer to the project [page](https://meituan-longcat.github.io/UNO-Bench). ## 🔍 Results Our main evaluation reveals a clear performance hierarchy where proprietary models, particularly Gemini-2.5-Pro, establish the state-of-the-art across all benchmarks. <p align="center"> <img src="./assets/cross-modal-results.png" width="60%" height="100%" /> </p> **Finding 1. 📍Perception Ability and Reasoning Ability:** Compared to human experts, Gemini-2.5-Pro exhibits similar performance in perception, but falls significantly behind in reasoning. Meanwhile, humans are more proficient in reasoning as opposed to perception (81.3% compared to 74.3%). <p align="center"> <img src="./assets/gemini-2.5-vs-human.png" width="60%" height="100%" /> </p> **Finding 2. 📍Compositional Law: Omni-modal capability effectiveness correlates with the product of individual modality performances following a power-law.** Based on the fundamental premise that nearly 100% of the questions in UNO-Bench require a joint understanding of audio and visual information, we combine experimental observations with rigorous mathematical derivation to propose the following formula for the compositional law. $$ P_{\text{Omni}} = C \cdot (P_{\text{Audio}} \times P_{\text{Visual}})^{\alpha} + b $$ This model fits our data almost perfectly, achieving a coefficient of determination ($R^2$) of $0.9759$. - $α=2.19$ is the synergistic exponent greater than 1, explaining the transition from a "short-board effect" to an "emergent ability". - $b=0.24$ is the baseline bias close to 0.25, reflecting the random-guess accuracy of our benchmark. - $C=1.03$ is the scaling coefficient close to 1, indicating a harmonious and naturally scaled system. <p align="center"> <img src="./assets/compositional-law.png" width="60%" height="100%" /> </p> **Finding 3. 📍Redundant Synchronized Audio-visual Video Data:** Audio-visual synchronized video data is highly redundant, making it challenging to design questions that test understanding of both audio and visual. Consequently, using standard videos for training or evaluation makes it difficult to develop models with effective modal collaboration capabilities. For samples, please visit the project [page](https://meituan-longcat.github.io/UNO-Bench). ## 📌 Checklist - **Data** - ✅ Benchmark Leaderboard - ✅ UNO-Bench Dataset - **Code** - □ Evaluation Toolkit - □ Model Weights and Configurations ## 🖊️ Citation If you find our work helpful for your research, please consider citing our work. ```bash @misc{chen2025unobench, title={UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models}, author={Chen Chen and ZeYang Hu and Fengjiao Chen and Liya Ma and Jiaxing Liu and Xiaoyu Li and Ziwen Wang and Xuezhi Cao and Xunliang Cai}, year={2025}, eprint={2510.18915}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.18915}, } ``` ## 🔮 Data Statements The majority of our materials are real-world photos and videos collected through crowdsourcing, while a small fraction comes from high-quality public datasets such as [MMVU](https://arxiv.org/abs/2501.12380), [LongVideoBench](https://arxiv.org/abs/2407.15754), [VideoVista](https://arxiv.org/abs/2504.17821), [WorldSense](https://arxiv.org/abs/2502.04326) and [AV-Odyssey](https://arxiv.org/abs/2412.02611). Additionally, We employ 18 publicly available benchmarks for compressing visual datasets and audio datasets, including [RealWorldQA](https://huggingface.co/datasets/xai-org/RealworldQA), [MME](https://arxiv.org/abs/2306.13394), [SeedBench](https://arxiv.org/abs/2307.16125), [OCRBench](https://arxiv.org/abs/2305.07895), [Fox](https://arxiv.org/abs/2405.14295), [DocLocal4k](https://arxiv.org/abs/2307.02499), [MMMU](https://arxiv.org/abs/2311.16502), [MMMU-Pro](https://arxiv.org/abs/2409.02813), [CMMMU](https://arxiv.org/abs/2401.11944), [MathVista](https://arxiv.org/abs/2310.02255), [MathVision](https://arxiv.org/abs/2402.14804), [ScienceVista](https://arxiv.org/abs/2501.12599), [GMAI-MMBench](https://arxiv.org/abs/2408.03361), [ReMi](https://arxiv.org/abs/2406.09175), [MuirBench](https://arxiv.org/abs/2406.09411), [MMAU](https://arxiv.org/abs/2410.19168), [MMSU](https://arxiv.org/abs/2506.04779) and [SDQA](https://arxiv.org/abs/2109.12072). ## 📐 Acknowledgments We hereby express our appreciation to the LongCat Team EVA Committee for their valuable assistance, guidance, and suggestions throughout the course of this work.

<h1>UNO-Bench：用于探索全模态模型中单模态与全模态间组合规律的统一基准测试集</h1> <p align="center" width="100%"> <img src="assets/uno-bench-title.jpeg" width="80%" height="100%"> </p> <div align="center" style="line-height: 1;"> <a target="_blank" href='https://meituan-longcat.github.io/UNO-Bench'><img src='https://img.shields.io/badge/项目主页-green'></a> <a target="_blank" href='https://agi-eval.cn/evaluation/detail?id=139'><img src='https://img.shields.io/badge/排行榜页面-orange'></a> <a target="_blank" href='https://arxiv.org/abs/2510.18915'><img src='https://img.shields.io/badge/技术报告-red'></a> <a target="_blank" href='https://huggingface.co/datasets/meituan-longcat/UNO-Bench'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-数据集-blue'></a> <a href='./'><img src='https://img.shields.io/badge/许可证-MIT-f5de53?&color=f5de53'></a> </div> ## 👀 UNO-Bench 概述多模态大语言模型（Large Language Model, LLM）正从单模态理解向统一视觉、音频与语言模态的方向发展，这类模型统称为全模态（omni-modal）模型。然而，单模态（uni-modal）与全模态间的关联仍不明确，亟需通过全面评估推动全模态模型的智能演进。本工作提出了一款新颖、高质量的**统一全模态模型基准测试集（UNO-Bench）**。该基准基于统一的能力分类体系，可有效评估单模态与全模态能力，涵盖44种任务类型与5种模态组合。其中包含1250份人工精选的全模态样本（跨模态可解率达98%），以及2480份增强型单模态样本。该人工生成数据集适配真实应用场景，尤其贴合中文语境；而自动压缩数据集则可实现90%的速度提升，并在18个公开基准上保持98%的一致性。除传统选择题外，我们还提出了创新性的多步开放式问答格式，以评估复杂推理能力。此外集成了通用评分模型，支持6种题型的自动化评估，准确率达95%。实验结果揭示了全模态与单模态性能间的**组合规律**：全模态能力在弱模型上呈现瓶颈效应，而在强模型上则表现为协同提升。 <div> <p align="center"> <img src="./assets/omni-ability.png" width="80%" height="100%" /> </p> </div>  <div> <p align="center"> <img src="./assets/data-statistics.png" width="80%" height="100%" /> </p> </div> ## 🌟 主要贡献 - 🌟 **提出首个统一全模态模型基准测试集UNO-Bench**，可高效评估单模态与全模态理解能力，验证了两类能力间的组合规律：该规律对弱模型构成瓶颈，对强模型则实现能力增强。 - 🌟 **构建了高质量的数据集流水线**，融合人工主导流程与自动压缩技术。UNO-Bench包含1250份全模态样本（跨模态可解率98%）与2480份单模态样本，覆盖44种任务类型与5种模态组合。该数据集适配真实应用场景，尤其贴合中文语境；同时可实现90%的速度提升，并在18个基准上保持98%的排名一致性。 - 🌟 **引入多步开放式问答（Multi-Step Open-Ended Questions, MO）格式**用于复杂推理评估，可生成更贴合实际的评测结果。通用评分模型支持6种题型，在分布外模型与基准上的准确率达95%。 ## 📊 数据集构建 ### 素材采集我们的素材具备三大核心特征：**a. 来源多样**——主要来自众包平台采集的真实照片与视频，辅以无版权网站与高质量公开数据集。**b. 主题丰富多元**——涵盖社会、文化、艺术、生活、文学与科学等领域。**c. 现场录制音频**——由20余名人类录制对话音频，确保音频特征丰富且贴合真实人声多样性。 ### 问答标注我们的标注人员包含人类专家与熟练的众包用户。人类专家具备丰富的跨模态数据与模型理解经验，可确保标注数据的专业性与针对性。众包用户主要为大学生，凭借其对多模态模型的使用经验与多元背景，可提供真实且多样化的标注数据。 ### 质量检测为确保数据质量，我们采用自动化工具与人工审核结合的多阶段质量保证体系。每个问题需经过三次独立检查：初步模型校验过滤歧义或不合规问题；模态消融实验通过移除单一模态来测试跨模态可解性；最终人工审核与修订确保数据准确性。 ### 数据压缩针对自动化数据压缩，我们提出了聚类引导的分层抽样方法，对18个公开基准数据集进行规模压缩，实现了90%的数据集压缩率，并保持98%的排名一致性。 <div> <p align="center"> <img src="./assets/omni-data-pipeline.png" width="80%" height="100%" /> </p> </div> ## 📍 数据集示例 UNO-Bench的能力被系统划分为两大核心维度：感知与推理。请点击[链接](https://huggingface.co/datasets/meituan-longcat/UNO-Bench)下载UNO-Bench。以下展示部分UNO-Bench示例： <p align="center"> <img alt="感知示例" src="./assets/omni-perception-cases.png" /> </p> --- <p align="center"> <img alt="推理示例" src="./assets/omni-reasoning-cases.png" /> </p> 更多示例请访问项目[页面](https://meituan-longcat.github.io/UNO-Bench)。 ## 🔍 实验结果我们的主评测显示出清晰的性能层级：专有模型（尤其是Gemini-2.5-Pro）在所有基准测试中均达到了当前最优水平。 <p align="center"> <img src="./assets/cross-modal-results.png" width="60%" height="100%" /> </p> **发现1. 📍感知能力与推理能力**：与人类专家相比，Gemini-2.5-Pro在感知任务上性能相近，但在推理任务上表现显著落后。而人类更擅长推理而非感知（准确率分别为81.3%与74.3%）。 <p align="center"> <img src="./assets/gemini-2.5-vs-human.png" width="60%" height="100%" /> </p> **发现2. 📍组合规律：全模态能力效能遵循幂律，与各单模态性能的乘积相关** 基于UNO-Bench中近100%的问题均需结合音频与视觉信息进行理解这一基本前提，我们结合实验观测与严谨的数学推导，提出了如下组合规律公式： $$P_{ ext{Omni}} = C cdot (P_{ ext{Audio}} imes P_{ ext{Visual}})^{alpha} + b$$ 该模型几乎完美拟合我们的实验数据，决定系数（$R^2$）达0.9759。 - $alpha=2.19$ 为大于1的协同指数，解释了从“短板效应”到“涌现能力”的转变。 - $b=0.24$ 为接近0.25的基准偏差，反映了本基准的随机猜测准确率。 - $C=1.03$ 为接近1的缩放系数，表明系统缩放和谐自然。 <p align="center"> <img src="./assets/compositional-law.png" width="60%" height="100%" /> </p> **发现3. 📍冗余的音视频同步数据**：音视频同步的视频数据存在高度冗余，使得设计同时测试音频与视觉理解的问题极具挑战性。因此，使用标准视频数据进行训练或评估，难以开发出具备有效模态协同能力的模型。更多示例请访问项目[页面](https://meituan-longcat.github.io/UNO-Bench)。 ## 📌 清单 - **数据** - ✅ 基准测试排行榜 - ✅ UNO-Bench数据集 - **代码** - □ 评测工具包 - □ 模型权重与配置文件 ## 🖊️ 引用若您的研究工作得益于本项目，请引用如下文献： bibtex @misc{chen2025unobench, title={UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models}, author={Chen Chen and ZeYang Hu and Fengjiao Chen and Liya Ma and Jiaxing Liu and Xiaoyu Li and Ziwen Wang and Xuezhi Cao and Xunliang Cai}, year={2025}, eprint={2510.18915}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.18915}, } ## 🔮 数据声明我们的大部分素材来自众包平台采集的真实照片与视频，小部分来自[MMVU](https://arxiv.org/abs/2501.12380)、[LongVideoBench](https://arxiv.org/abs/2407.15754)、[VideoVista](https://arxiv.org/abs/2504.17821)、[WorldSense](https://arxiv.org/abs/2502.04326)与[AV-Odyssey](https://arxiv.org/abs/2412.02611)等高质量公开数据集。此外，我们采用18个公开基准数据集用于压缩视觉与音频数据集，包括[RealWorldQA](https://huggingface.co/datasets/xai-org/RealworldQA)、[MME](https://arxiv.org/abs/2306.13394)、[SeedBench](https://arxiv.org/abs/2307.16125)、[OCRBench](https://arxiv.org/abs/2305.07895)、[Fox](https://arxiv.org/abs/2405.14295)、[DocLocal4k](https://arxiv.org/abs/2307.02499)、[MMMU](https://arxiv.org/abs/2311.16502)、[MMMU-Pro](https://arxiv.org/abs/2409.02813)、[CMMMU](https://arxiv.org/abs/2401.11944)、[MathVista](https://arxiv.org/abs/2310.02255)、[MathVision](https://arxiv.org/abs/2402.14804)、[ScienceVista](https://arxiv.org/abs/2501.12599)、[GMAI-MMBench](https://arxiv.org/abs/2408.03361)、[ReMi](https://arxiv.org/abs/2406.09175)、[MuirBench](https://arxiv.org/abs/2406.09411)、[MMAU](https://arxiv.org/abs/2410.19168)、[MMSU](https://arxiv.org/abs/2506.04779)与[SDQA](https://arxiv.org/abs/2109.12072)。 ## 📐 致谢感谢LongCat团队EVA委员会在本研究过程中提供的宝贵协助、指导与建议。

提供机构：

maas

创建时间：

2025-11-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集