five

Logics-MLLM/OmniParsingBench

收藏
Hugging Face2026-04-08 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/Logics-MLLM/OmniParsingBench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - zh - en pretty_name: OmniParsingBench tags: - parsing - chart - document - multimodal configs: - config_name: default data_files: - split: natural_image path: data/natural_image.jsonl - split: graphics path: data/graphics.jsonl - split: audio path: data/audio.jsonl - split: natural_video path: data/natural_video.jsonl - split: textrich_video path: data/textrich_video.jsonl --- <div align="center"> <img src="logo.png" width="80%"> </div> <p align="center"> 🤗 <a href="https://huggingface.co/Logics-MLLM/Logics-Parsing-Omni">Model</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/pdf/2603.09677">Technical Report</a>&nbsp&nbsp | &nbsp&nbsp💻 <a href="https://github.com/alibaba/Logics-Parsing/tree/main/Logics-Parsing-Omni">GitHub</a> </p> **OmniParsingBench** is a comprehensive, large-scale, and high-quality evaluation corpus designed to rigorously evaluate the unified parsing capabilities of Multimodal Large Language Models (MLLMs) across diverse modalities. Unlike traditional single-task benchmarks, OmniParsingBench assesses the full spectrum of parsing performance—from fundamental signal detection to complex semantic reasoning—across six primary domains: **Document, Natural Image, Graphics, Audio, Natural Video, and Text-Rich Video**. ## 📖 Evaluation Framework & Metrics Our evaluation framework strictly aligns with a proposed three-stage architecture, systematically assessing performance across different cognitive levels: - **L1 - Holistic Detection:** Spatio-temporal grounding and classification. - **L2 - Fine-grained Recognition:** Symbol extraction, attribute identification, and structural recovery. - **L3 - Multi-level Interpreting:** Semantic consistency and hallucination resistance. To provide a concise view of model capabilities, we aggregate these fine-grained metrics into two core scores, alongside an overall metric: * **Perception (Perc.):** Evaluates signal precision and structural fidelity (dominating L1 and L2). * **Cognition (Cog.):** Evaluates logical reasoning and semantic understanding (dominating L3). * **Overall (Ovr.):** The comprehensive performance metric across all levels. ## 🏆 Leaderboard ### Overall Performance <div align="center"> <table> <thead> <tr> <th rowspan="2" align="left" valign="middle">Model</th> <th colspan="3" align="center">Natural Image</th> <th colspan="3" align="center">Graphics</th> <th colspan="1" align="center">Document</th> <th colspan="3" align="center">Audio</th> <th colspan="3" align="center">Natural Video</th> <th colspan="3" align="center">Text-Rich Video</th> </tr> <tr> <th align="center">Ovr.</th> <th align="center">Perc.</th> <th align="center">Cog.</th> <th align="center">Ovr.</th> <th align="center">Perc.</th> <th align="center">Cog.</th> <th align="center">Perc.</th> <th align="center">Ovr.</th> <th align="center">Perc.</th> <th align="center">Cog.</th> <th align="center">Ovr.</th> <th align="center">Perc.</th> <th align="center">Cog.</th> <th align="center">Ovr.</th> <th align="center">Perc.</th> <th align="center">Cog.</th> </tr> </thead> <tbody> <tr> <td align="left">Gemini-3-Pro</td> <td align="center"><b>61.20</b></td> <td align="center">55.96</td> <td align="center"><b>66.44</b></td> <td align="center"><u>87.03</u></td> <td align="center"><b>84.21</b></td> <td align="center">87.43</td> <td align="center"><b>87.01</b></td> <td align="center"><u>79.40</u></td> <td align="center"><b>72.90</b></td> <td align="center">85.89</td> <td align="center"><b>63.40</b></td> <td align="center"><b>57.87</b></td> <td align="center"><b>68.92</b></td> <td align="center"><u>64.37</u></td> <td align="center"><b>58.54</b></td> <td align="center"><u>70.20</u></td> </tr> <tr> <td align="left">GPT-5.2</td> <td align="center">39.94</td> <td align="center">37.77</td> <td align="center">42.12</td> <td align="center">82.71</td> <td align="center">69.86</td> <td align="center"><u>91.48</u></td> <td align="center">77.43</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> </tr> <tr> <td align="left">Qwen3.5-397B-A17B</td> <td align="center">57.40</td> <td align="center"><b>56.95</b></td> <td align="center">57.85</td> <td align="center">82.81</td> <td align="center">73.77</td> <td align="center">83.13</td> <td align="center">81.09</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> </tr> <tr> <td align="left">Qwen3-VL-235B-A22B</td> <td align="center">58.61</td> <td align="center"><u>56.23</u></td> <td align="center">60.99</td> <td align="center">79.49</td> <td align="center">71.51</td> <td align="center">83.46</td> <td align="center">84.47</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> </tr> <tr> <td align="left">Qwen3-VL-30B-A3B</td> <td align="center">50.92</td> <td align="center">48.91</td> <td align="center">52.94</td> <td align="center">73.25</td> <td align="center">65.71</td> <td align="center">79.36</td> <td align="center">78.94</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> <td align="center">--</td> </tr> <tr> <td align="left">Qwen3-Omni-30B-A3B</td> <td align="center">47.36</td> <td align="center">46.85</td> <td align="center">47.88</td> <td align="center">77.46</td> <td align="center">70.75</td> <td align="center">78.25</td> <td align="center">73.50</td> <td align="center">75.17</td> <td align="center">62.13</td> <td align="center"><u>88.22</u></td> <td align="center">45.23</td> <td align="center">34.15</td> <td align="center">56.32</td> <td align="center">26.86</td> <td align="center">10.22</td> <td align="center">43.50</td> </tr> <tr> <td align="left"><b>Logics-Parsing-Omni (Ours)</b></td> <td align="center"><u>59.07</u></td> <td align="center">53.77</td> <td align="center"><u>64.37</u></td> <td align="center"><b>88.66</b></td> <td align="center"><u>82.01</u></td> <td align="center"><b>92.12</b></td> <td align="center"><u>84.90</u></td> <td align="center"><b>79.63</b></td> <td align="center"><u>69.27</u></td> <td align="center"><b>89.99</b></td> <td align="center"><u>61.12</u></td> <td align="center"><u>56.09</u></td> <td align="center"><u>66.15</u></td> <td align="center"><b>69.12</b></td> <td align="center"><u>57.39</u></td> <td align="center"><b>80.85</b></td> </tr> </tbody> </table> <p align="left"><em>Note: <b>Bold text</b> indicates the best result, and <u>underlined text</u> indicates the second-best result.</em></p> </div> ### 📊 Results Analysis As detailed in the table above, **Logics-Parsing-Omni** demonstrates highly competitive or state-of-the-art capabilities across all six diverse modalities: * **Dominance in Complex Modalities:** Our model consistently surpasses all evaluated baselines—including the leading proprietary **Gemini-3-Pro**—in the *Overall* and *Cognition* metrics of the *Graphics, Audio, and Text-Rich Video* domains. * **Exceptional Semantic Understanding:** The superiority is particularly pronounced in the **Cognition** metric, where Logics-Parsing-Omni exhibits exceptional logical reasoning and semantic understanding, achieving top-tier scores such as **92.12** in Graphics and **80.85** in Text-Rich Video. * **Leading Open-Weight Performance:** While Gemini-3-Pro maintains an advantage in the fundamental *Perception* of Natural Images, Graphics, Audio, and Documents, as well as a marginal lead in Natural Video, our model significantly outperforms other open-weight counterparts (e.g., the Qwen series) in nearly all metrics. These quantitative results validate the efficacy of our L1–L3 architecture, demonstrating that Logics-Parsing-Omni successfully bridges fundamental signal detection with complex multi-modal interpreting. ## 📊 Dataset Overview | Split | Modality | Source | Size | |-------|----------|--------|------| | `natural_image` | Image | [Pexels](https://www.pexels.com), [Wikimedia Commons](https://commons.wikimedia.org) | 1,000 | | `graphics` | Image | Synthesized (charts & geometric figures) | 1,000 | | `audio` | Audio | [YouTube](https://www.youtube.com) | 1,014 | | `natural_video` | Video | [YouTube](https://www.youtube.com) | 1,121 | | `textrich_video` | Video | [YouTube](https://www.youtube.com) | 259 | | `document` | Document | [SkylenAge](https://skylenage.alibabagroup.com/sla/evaluation/detail?id=OFW6tlGUt2F4merPuEF26) | 900 pages | | **Total** | | | **5294** | > **Data Fields:** `ID` · `URL` · `Start_time/End_time` · `Cognition` · `Perception` · `Split` ## 📝 Citation If you find OmniParsingBench or our model useful in your research, please consider citing our technical report: ```bibtex @article{logicsparsingomni2026, title={Logics-Parsing-Omni: Bridging Fine-Grained Perception and Semantic Cognition in Multimodal Parsing}, author={Logics Team}, journal={arXiv preprint arXiv:2603.09677}, year={2026} }
提供机构:
Logics-MLLM
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作