Logics-MLLM/OmniParsingBench
收藏Hugging Face2026-04-08 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/Logics-MLLM/OmniParsingBench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- zh
- en
pretty_name: OmniParsingBench
tags:
- parsing
- chart
- document
- multimodal
configs:
- config_name: default
data_files:
- split: natural_image
path: data/natural_image.jsonl
- split: graphics
path: data/graphics.jsonl
- split: audio
path: data/audio.jsonl
- split: natural_video
path: data/natural_video.jsonl
- split: textrich_video
path: data/textrich_video.jsonl
---
<div align="center">
<img src="logo.png" width="80%">
</div>
<p align="center">
🤗 <a href="https://huggingface.co/Logics-MLLM/Logics-Parsing-Omni">Model</a>   |   📑 <a href="https://arxiv.org/pdf/2603.09677">Technical Report</a>   |   💻 <a href="https://github.com/alibaba/Logics-Parsing/tree/main/Logics-Parsing-Omni">GitHub</a>
</p>
**OmniParsingBench** is a comprehensive, large-scale, and high-quality evaluation corpus designed to rigorously evaluate the unified parsing capabilities of Multimodal Large Language Models (MLLMs) across diverse modalities.
Unlike traditional single-task benchmarks, OmniParsingBench assesses the full spectrum of parsing performance—from fundamental signal detection to complex semantic reasoning—across six primary domains: **Document, Natural Image, Graphics, Audio, Natural Video, and Text-Rich Video**.
## 📖 Evaluation Framework & Metrics
Our evaluation framework strictly aligns with a proposed three-stage architecture, systematically assessing performance across different cognitive levels:
- **L1 - Holistic Detection:** Spatio-temporal grounding and classification.
- **L2 - Fine-grained Recognition:** Symbol extraction, attribute identification, and structural recovery.
- **L3 - Multi-level Interpreting:** Semantic consistency and hallucination resistance.
To provide a concise view of model capabilities, we aggregate these fine-grained metrics into two core scores, alongside an overall metric:
* **Perception (Perc.):** Evaluates signal precision and structural fidelity (dominating L1 and L2).
* **Cognition (Cog.):** Evaluates logical reasoning and semantic understanding (dominating L3).
* **Overall (Ovr.):** The comprehensive performance metric across all levels.
## 🏆 Leaderboard
### Overall Performance
<div align="center">
<table>
<thead>
<tr>
<th rowspan="2" align="left" valign="middle">Model</th>
<th colspan="3" align="center">Natural Image</th>
<th colspan="3" align="center">Graphics</th>
<th colspan="1" align="center">Document</th>
<th colspan="3" align="center">Audio</th>
<th colspan="3" align="center">Natural Video</th>
<th colspan="3" align="center">Text-Rich Video</th>
</tr>
<tr>
<th align="center">Ovr.</th>
<th align="center">Perc.</th>
<th align="center">Cog.</th>
<th align="center">Ovr.</th>
<th align="center">Perc.</th>
<th align="center">Cog.</th>
<th align="center">Perc.</th>
<th align="center">Ovr.</th>
<th align="center">Perc.</th>
<th align="center">Cog.</th>
<th align="center">Ovr.</th>
<th align="center">Perc.</th>
<th align="center">Cog.</th>
<th align="center">Ovr.</th>
<th align="center">Perc.</th>
<th align="center">Cog.</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Gemini-3-Pro</td>
<td align="center"><b>61.20</b></td>
<td align="center">55.96</td>
<td align="center"><b>66.44</b></td>
<td align="center"><u>87.03</u></td>
<td align="center"><b>84.21</b></td>
<td align="center">87.43</td>
<td align="center"><b>87.01</b></td>
<td align="center"><u>79.40</u></td>
<td align="center"><b>72.90</b></td>
<td align="center">85.89</td>
<td align="center"><b>63.40</b></td>
<td align="center"><b>57.87</b></td>
<td align="center"><b>68.92</b></td>
<td align="center"><u>64.37</u></td>
<td align="center"><b>58.54</b></td>
<td align="center"><u>70.20</u></td>
</tr>
<tr>
<td align="left">GPT-5.2</td>
<td align="center">39.94</td>
<td align="center">37.77</td>
<td align="center">42.12</td>
<td align="center">82.71</td>
<td align="center">69.86</td>
<td align="center"><u>91.48</u></td>
<td align="center">77.43</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
</tr>
<tr>
<td align="left">Qwen3.5-397B-A17B</td>
<td align="center">57.40</td>
<td align="center"><b>56.95</b></td>
<td align="center">57.85</td>
<td align="center">82.81</td>
<td align="center">73.77</td>
<td align="center">83.13</td>
<td align="center">81.09</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
</tr>
<tr>
<td align="left">Qwen3-VL-235B-A22B</td>
<td align="center">58.61</td>
<td align="center"><u>56.23</u></td>
<td align="center">60.99</td>
<td align="center">79.49</td>
<td align="center">71.51</td>
<td align="center">83.46</td>
<td align="center">84.47</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
</tr>
<tr>
<td align="left">Qwen3-VL-30B-A3B</td>
<td align="center">50.92</td>
<td align="center">48.91</td>
<td align="center">52.94</td>
<td align="center">73.25</td>
<td align="center">65.71</td>
<td align="center">79.36</td>
<td align="center">78.94</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
<td align="center">--</td>
</tr>
<tr>
<td align="left">Qwen3-Omni-30B-A3B</td>
<td align="center">47.36</td>
<td align="center">46.85</td>
<td align="center">47.88</td>
<td align="center">77.46</td>
<td align="center">70.75</td>
<td align="center">78.25</td>
<td align="center">73.50</td>
<td align="center">75.17</td>
<td align="center">62.13</td>
<td align="center"><u>88.22</u></td>
<td align="center">45.23</td>
<td align="center">34.15</td>
<td align="center">56.32</td>
<td align="center">26.86</td>
<td align="center">10.22</td>
<td align="center">43.50</td>
</tr>
<tr>
<td align="left"><b>Logics-Parsing-Omni (Ours)</b></td>
<td align="center"><u>59.07</u></td>
<td align="center">53.77</td>
<td align="center"><u>64.37</u></td>
<td align="center"><b>88.66</b></td>
<td align="center"><u>82.01</u></td>
<td align="center"><b>92.12</b></td>
<td align="center"><u>84.90</u></td>
<td align="center"><b>79.63</b></td>
<td align="center"><u>69.27</u></td>
<td align="center"><b>89.99</b></td>
<td align="center"><u>61.12</u></td>
<td align="center"><u>56.09</u></td>
<td align="center"><u>66.15</u></td>
<td align="center"><b>69.12</b></td>
<td align="center"><u>57.39</u></td>
<td align="center"><b>80.85</b></td>
</tr>
</tbody>
</table>
<p align="left"><em>Note: <b>Bold text</b> indicates the best result, and <u>underlined text</u> indicates the second-best result.</em></p>
</div>
### 📊 Results Analysis
As detailed in the table above, **Logics-Parsing-Omni** demonstrates highly competitive or state-of-the-art capabilities across all six diverse modalities:
* **Dominance in Complex Modalities:** Our model consistently surpasses all evaluated baselines—including the leading proprietary **Gemini-3-Pro**—in the *Overall* and *Cognition* metrics of the *Graphics, Audio, and Text-Rich Video* domains.
* **Exceptional Semantic Understanding:** The superiority is particularly pronounced in the **Cognition** metric, where Logics-Parsing-Omni exhibits exceptional logical reasoning and semantic understanding, achieving top-tier scores such as **92.12** in Graphics and **80.85** in Text-Rich Video.
* **Leading Open-Weight Performance:** While Gemini-3-Pro maintains an advantage in the fundamental *Perception* of Natural Images, Graphics, Audio, and Documents, as well as a marginal lead in Natural Video, our model significantly outperforms other open-weight counterparts (e.g., the Qwen series) in nearly all metrics.
These quantitative results validate the efficacy of our L1–L3 architecture, demonstrating that Logics-Parsing-Omni successfully bridges fundamental signal detection with complex multi-modal interpreting.
## 📊 Dataset Overview
| Split | Modality | Source | Size |
|-------|----------|--------|------|
| `natural_image` | Image | [Pexels](https://www.pexels.com), [Wikimedia Commons](https://commons.wikimedia.org) | 1,000 |
| `graphics` | Image | Synthesized (charts & geometric figures) | 1,000 |
| `audio` | Audio | [YouTube](https://www.youtube.com) | 1,014 |
| `natural_video` | Video | [YouTube](https://www.youtube.com) | 1,121 |
| `textrich_video` | Video | [YouTube](https://www.youtube.com) | 259 |
| `document` | Document | [SkylenAge](https://skylenage.alibabagroup.com/sla/evaluation/detail?id=OFW6tlGUt2F4merPuEF26) | 900 pages |
| **Total** | | | **5294** |
> **Data Fields:** `ID` · `URL` · `Start_time/End_time` · `Cognition` · `Perception` · `Split`
## 📝 Citation
If you find OmniParsingBench or our model useful in your research, please consider citing our technical report:
```bibtex
@article{logicsparsingomni2026,
title={Logics-Parsing-Omni: Bridging Fine-Grained Perception and Semantic Cognition in Multimodal Parsing},
author={Logics Team},
journal={arXiv preprint arXiv:2603.09677},
year={2026}
}
提供机构:
Logics-MLLM



