VideoThinkBench

Name: VideoThinkBench
Creator: maas
Published: 2026-01-06 16:51:26
License: 暂无描述

魔搭社区2026-01-06 更新2025-11-15 收录

下载链接：

https://modelscope.cn/datasets/openmoss/VideoThinkBench

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> # Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm </div> <div align="center" style="font-size: 15pt"> <a href='https://arxiv.org/abs/2511.04570'><img src='https://img.shields.io/badge/Arxiv-2511.04570-purple'></a> <a href='https://huggingface.co/papers/2511.04570'><img src='https://img.shields.io/badge/HF%20Paper-2511.04570-blue'></a> <a href='https://thinking-with-video.github.io/'><img src='https://img.shields.io/badge/Project-Website-green'></a> <a href='https://github.com/tongjingqi/Thinking-with-Video'><img src='https://img.shields.io/badge/Code-GitHub-black'></a> <a href='https://thinking-with-video.github.io/#leaderboard'><img src='https://img.shields.io/badge/Leaderboard-Table-E07A5F'></a> </div> <div align="center"> <a href="https://huggingface.co/papers/date/2025-11-07"> <img src="assets/huggingface_paper_gold_week.svg"/> </a> </div> ## 🎊 News  - [2025.11] Our paper "Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm" has been released on arXiv! 📄 [[Paper](https://arxiv.org/abs/2511.04570)] On HuggingFace, it has achieved "#1 Paper of the Day"! - [2025.11] 🔥We release *["minitest"](https://huggingface.co/datasets/OpenMOSS-Team/VideoThinkBench)* of our VideoThinkBench, including 500 test samples of vision-centric tasks and 250 test samples of text-centric tasks. - [2025.12] 🔥We release VideoThinkBench [Leaderboard](https://thinking-with-video.github.io/#leaderboard) that includes different models. ## 📜 Brief Introduction  Moving beyond the traditional paradigms of "Thinking with Text" (e.g., Chain-of-Thought) and "Thinking with Images", we propose "**Thinking with Video**"—a new paradigm that unifies visual and textual reasoning through video generation models. It naturally enables human-like dynamic reasoning through video generation, such as **drawing and imagination**. 💡 **A New Unified Reasoning Paradigm**     "Thinking with Video" leverages video generation models to visualize dynamic processes, represent temporal evolution, and embed text within video frames. This approach achieves unified multimodal understanding and generation, overcoming the static constraints of image-based reasoning and the modality separation in traditional approaches. 📊 **VideoThinkBench: A Comprehensive Benchmark**     We developed VideoThinkBench, the first reasoning benchmark specifically designed for evaluating video generation models. It comprises vision-centric tasks (eyeballing puzzles, visual puzzles, ARC-AGI-2, mazes) that leverage dynamic visual reasoning, and text-centric tasks adapted from established benchmarks (MATH, GSM8K, MMLU, MMMU, etc.) that test text-based reasoning capabilities within generated videos. 🚀 **Surpassing VLMs on Several Tasks**     Our evaluation shows that Sora-2 demonstrates competitive reasoning capabilities across both categories. Notably, Sora-2 **surpasses state-of-the-art vision-language models on several vision-centric tasks**, showcasing the unique advantages of dynamic visual reasoning. On text-centric tasks, Sora-2 achieves strong performance including 98.9% on GSM8K, 94.0% on MATH, and 75.5% on MMMU, demonstrating the potential of "Thinking with Video" as a unified multimodal reasoning paradigm. <div align="center"> <img src="assets/main_picture.png" width=80% /> </div> ## 📝 Paper Abstract  "Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm. ## 📚 VideoThinkBench Details  VideoThinkBench is a comprehensive benchmark for evaluating video generation models' reasoning capabilities, consisting of two main categories: ### Vision-Centric Tasks - **Eyeballing Puzzles**: Spatial reasoning tasks requiring visual estimation and drawing - **Visual Puzzles**: Pattern recognition and visual logic problems - **ARC-AGI-2**: Abstract reasoning tasks requiring few-shot learning - **Mazes**: Path-finding and navigation challenges ### Text-Centric Tasks Adapted from established benchmarks including: - **Mathematical Reasoning**: MATH, GSM8K, AIME, MathVista, MathVision - **Multimodal Understanding**: MMMU, MMBench - **General Knowledge**: MMLU, MMLU-Pro - **Scientific Reasoning**: GPQA-diamond, SuperGPQA  ## ✨ Benchmark Results  ### Performance Comparison Across All Tasks The table below summarizes the accuracy (%) of Sora-2 compared with state-of-the-art vision-language models across all second-level tasks in VideoThinkBench: | **Category** | **Task** | **Sora-2** | **Gemini 2.5 Pro** | **GPT5 high** | **Claude Sonnet 4.5** | |--------------|----------|------------|-------------------|--------------|---------------------| | **Vision-Centric** | Eyeballing-Point | 44.7 | 27.8 | 33.6 | 36.2 | | | Eyeballing-Line | 38.0 | 21.0 | 24.0 | 26.3 | | | Eyeballing-Shape | 34.5 | 34.5 | 32.5 | 50.5 | | | Visual-Color | 67.0 | 73.9 | 79.6 | 85.6 | | | Visual-Shape | 64.9 | 92.9 | 97.5 | 68.6 | | | ARC-AGI-2 | 1.3 | 4.9 | 9.9 | 13.6 | | | **Average** | **41.7** | **42.5** | **46.2** | **46.8** | | **Text-Centric** | Text-Only Math | 53.6 | 94.8 | 97.2 | 90.0 | | | Text-Only General Knowledge | 63.1 | 84.5 | 85.2 | 86.3 | | | Multimodal Math | 56.3 | 66.7 | 69.6 | 65.6 | | | Multimodal General Knowledge | 49.4 | 83.0 | 80.6 | 82.3 | | | **Average** | **55.6** | **82.3** | **83.2** | **81.1** | | **Overall Average** | | **47.3** | **58.4** | **61.0** | **60.5** | **Note**: For Sora-2: Eyeballing Puzzles use Major Frame evaluation; Visual Puzzles show the average of Color-Filling and Shape-Drawing tasks; Text-Centric Reasoning tasks use Video evaluation results. **🔥Leaderboard: [HERE](https://thinking-with-video.github.io/#leaderboard)**                 ## 💡 Takeaways  Our systematic evaluation on VideoThinkBench reveals seven key findings: 1. **Surpassing VLMs on Eyeballing Puzzles**: Sora-2 generally **surpasses SOTA VLMs** on eyeballing puzzles, exhibiting strong **geometric and physical reasoning** abilities. It can simulate the extension and reflection of rays and manipulate geometric elements (e.g., points and lines) to support spatial reasoning. 2. **Inductive Reasoning on Visual Puzzles**: Sora-2's performance is comparable to Claude Sonnet 4.5 on Shape-Drawing puzzles, demonstrating **inductive reasoning** capabilities. Sora-2 can recognize and apply **patterns of color, shape, and size**, solving visual puzzles involving symmetry, gradients, and compositionality. 3. **Few-Shot Learning Capabilities**: **Sora-2 is a few-shot learner**. On ARC-AGI-2, which requires finding patterns in input-output pairs, while SOTA VLMs achieve less than 5% accuracy, Sora-2 can often make **reasonable predictions**, although they do not strictly match dataset annotations. 4. **Unified Multimodal Reasoning**: On text-centric tasks, Sora-2 shows surprising performance on text and multimodal reasoning benchmarks. The video generation model can **embed text within video frames**, enabling unified multimodal understanding and generation. This demonstrates that "Thinking with Video" is potentially a **unified multimodal reasoning paradigm**. 5. **Improved In-Context Learning with More Examples**: Sora-2 achieves better in-context learning by providing more examples. Experiments show that Sora-2 performs better when provided with all examples compared to only one example, revealing an underexplored direction for analyzing and improving the in-context learning abilities of video generation models. 6. **Test-Time Scaling with Self-Consistency**: **Self-consistency can improve** Sora-2's performance on verifiable video generation reasoning tasks. This reveals an underexplored direction: **test-time scaling in video generation reasoning tasks**. 7. **Analysis of Capability Source**: We systematically analyzed the **source of Sora-2's capabilities**. Sora-2 maintains performance comparable to the original test set on adapted math problems, reducing the likelihood of test set leakage. However, Sora-2 struggles to generate coherent reasoning processes in videos, even when providing correct final answers. Through comparative experiments with Wan 2.5, we speculate that Sora-2's text-centric reasoning ability originates from its **prompt rewriter** model. ## ⚖️ Licenses  [![Code License](https://img.shields.io/badge/Code%20License-MIT-green.svg)](LICENSE) This project is licensed under the MIT License - see the LICENSE file for details. ## 🔎 Citation If you find our work helpful, please consider citing our paper 📝 and starring us ⭐️! ```bibtex @article{tong2025thinking, title={Thinking with video: Video generation as a promising multimodal reasoning paradigm}, author={Tong, Jingqi and Mou, Yurong and Li, Hangcheng and Li, Mingzhe and Yang, Yongzhuo and Zhang, Ming and Chen, Qiguang and Liang, Tianyi and Hu, Xiaomeng and Zheng, Yining and others}, journal={arXiv preprint arXiv:2511.04570}, year={2025} } ``` --- <div align="center"> Made with ❤️ for advancing multimodal reasoning research </div>

<div align="center"> # 以视频思考：视频生成作为极具前景的多模态推理范式 </div> <div align="center" style="font-size: 15pt"> <a href='https://arxiv.org/abs/2511.04570'><img src='https://img.shields.io/badge/Arxiv-2511.04570-purple'></a> <a href='https://huggingface.co/papers/2511.04570'><img src='https://img.shields.io/badge/HF%20Paper-2511.04570-blue'></a> <a href='https://thinking-with-video.github.io/'><img src='https://img.shields.io/badge/Project-Website-green'></a> <a href='https://github.com/tongjingqi/Thinking-with-Video'><img src='https://img.shields.io/badge/Code-GitHub-black'></a> <a href='https://thinking-with-video.github.io/#leaderboard'><img src='https://img.shields.io/badge/Leaderboard-Table-E07A5F'></a> </div> <div align="center"> <a href="https://huggingface.co/papers/date/2025-11-07"> <img src="assets/huggingface_paper_gold_week.svg"/> </a> </div> ## 🎊 最新动态  - [2025.11] 我们的论文《以视频思考：视频生成作为极具前景的多模态推理范式》已在arXiv上线！📄 [[论文](https://arxiv.org/abs/2511.04570)] 该论文在HuggingFace平台获评「当日Top1论文」！ - [2025.11] 🔥我们发布了VideoThinkBench的*「迷你测试集」（minitest）*，包含500个视觉中心任务测试样本与250个文本中心任务测试样本。 - [2025.12] 🔥我们发布了涵盖多款模型的VideoThinkBench[评测排行榜](https://thinking-with-video.github.io/#leaderboard)。 ## 📜 简要介绍  超越传统的「以文本思考」（如思维链（Chain-of-Thought））与「以图像思考」范式，我们提出「**以视频思考**」——一种通过视频生成模型统一视觉与文本推理的全新范式。该范式可借助视频生成实现类人的动态推理，例如**绘图与想象**。 💡 **全新统一推理范式**     「以视频思考」借助视频生成模型可视化动态过程、表征时序演化，并将文本嵌入视频帧中。该方法实现了统一的多模态理解与生成，突破了基于图像推理的静态局限，以及传统方法中模态分离的问题。 📊 **VideoThinkBench：全面评测基准**     我们开发了VideoThinkBench，这是首个专为评估视频生成模型推理能力设计的评测基准。其包含两大任务类别：依赖动态视觉推理的视觉中心任务（视觉猜测谜题、视觉谜题、ARC-AGI-2、迷宫），以及源自现有基准（如MATH、GSM8K、MMLU、MMMU等）的文本中心任务，用于测试生成视频场景下的文本推理能力。 🚀 **在多项任务上超越视觉语言模型**     我们的评估显示，Sora-2在两类任务上均展现出具备竞争力的推理能力。值得注意的是，Sora-2 **在多项视觉中心任务上超越了当前最优的视觉语言模型**，彰显了动态视觉推理的独特优势。在文本中心任务上，Sora-2取得了优异表现，包括GSM8K上的98.9%、MATH上的94.0%以及MMMU上的75.5%，证明了「以视频思考」作为统一多模态推理范式的潜力。 <div align="center"> <img src="assets/main_picture.png" width=80% /> </div> ## 📝 论文摘要  「以文本思考」与「以图像思考」范式显著提升了大语言模型（Large Language Model, LLM）与视觉语言模型（Vision Language Model, VLM）的推理能力，但这些范式存在固有局限：(1) 图像仅能捕捉单一瞬间，无法表征动态过程或连续变化；(2) 文本与视觉作为独立模态分离，阻碍了统一的多模态理解与生成。为克服这些局限，我们提出「以视频思考」这一新范式，借助如Sora-2的视频生成模型，在统一的时序框架中衔接视觉与文本推理。为支撑该探索，我们开发了视频思考基准（VideoThinkBench）。VideoThinkBench包含两大任务类别：(1) 视觉中心任务（如视觉猜测谜题）；(2) 文本中心任务（如GSM8K、MMMU的子集）。我们的评估证实Sora-2是具备能力的推理器：在视觉中心任务上，Sora-2整体表现与当前最优（State-of-the-Art, SOTA）VLM相当，甚至在视觉猜测谜题等多项任务上超越了VLM；在文本中心任务上，Sora-2在MATH上取得92%的准确率，在MMMU上取得75.53%的准确率。此外，我们系统分析了该能力的来源，还发现自一致性（self-consistency）与上下文学习（in-context learning）可提升Sora-2的性能。综上，我们的研究表明，视频生成模型具备成为统一多模态理解与生成模型的潜力，确立了「以视频思考」作为统一多模态推理范式的地位。 ## 📚 VideoThinkBench 详细信息  VideoThinkBench是用于评估视频生成模型推理能力的全面基准，包含两大主要类别： ### 视觉中心任务 - **视觉猜测谜题（Eyeballing Puzzles）**：需要视觉估算与绘图的空间推理任务 - **视觉谜题（Visual Puzzles）**：模式识别与视觉逻辑问题 - **ARC-AGI-2**：需要少样本学习的抽象推理任务 - **迷宫（Mazes）**：寻路与导航挑战 ### 文本中心任务源自现有基准，包括： - **数学推理**：MATH、GSM8K、AIME、MathVista、MathVision - **多模态理解**：MMMU、MMBench - **通用知识**：MMLU、MMLU-Pro - **科学推理**：GPQA-diamond、SuperGPQA ## ✨ 基准评测结果  ### 全任务性能对比下表汇总了Sora-2与当前最优视觉语言模型在VideoThinkBench所有二级任务上的准确率（%）： | **任务类别** | **子任务** | **Sora-2** | **Gemini 2.5 Pro** | **GPT5 high** | **Claude Sonnet 4.5** | |--------------|----------|------------|-------------------|--------------|---------------------| | **视觉中心任务** | 视觉猜测-点定位 | 44.7 | 27.8 | 33.6 | 36.2 | | | 视觉猜测-直线绘制 | 38.0 | 21.0 | 24.0 | 26.3 | | | 视觉猜测-形状识别 | 34.5 | 34.5 | 32.5 | 50.5 | | | 视觉谜题-颜色填充 | 67.0 | 73.9 | 79.6 | 85.6 | | | 视觉谜题-形状绘制 | 64.9 | 92.9 | 97.5 | 68.6 | | | ARC-AGI-2 | 1.3 | 4.9 | 9.9 | 13.6 | | | **任务平均** | **41.7** | **42.5** | **46.2** | **46.8** | | **文本中心任务** | 纯文本数学推理 | 53.6 | 94.8 | 97.2 | 90.0 | | | 纯文本通用知识 | 63.1 | 84.5 | 85.2 | 86.3 | | | 多模态数学推理 | 56.3 | 66.7 | 69.6 | 65.6 | | | 多模态通用知识 | 49.4 | 83.0 | 80.6 | 82.3 | | | **任务平均** | **55.6** | **82.3** | **83.2** | **81.1** | | **整体平均** | | **47.3** | **58.4** | **61.0** | **60.5** | **注**：针对Sora-2：视觉猜测谜题采用主帧评估；视觉谜题展示颜色填充与形状绘制任务的平均结果；文本中心推理任务采用视频评估结果。 **🔥 评测排行榜：[点击前往](https://thinking-with-video.github.io/#leaderboard)** ## 💡 核心发现  我们在VideoThinkBench上的系统评估得出7项关键结论： 1. **在视觉猜测谜题上超越VLM**：Sora-2在视觉猜测谜题上整体**超越当前最优VLM**，展现出强大的**几何与物理推理**能力。它能够模拟光线的延伸与反射，并操纵几何元素（如点与线）以支撑空间推理。 2. **视觉谜题上的归纳推理能力**：Sora-2在形状绘制谜题上的表现与Claude Sonnet 4.5相当，展现出**归纳推理**能力。Sora-2能够识别并应用**颜色、形状与尺寸的模式**，解决涉及对称性、梯度与组合性的视觉谜题。 3. **少样本学习能力**：**Sora-2具备少样本学习能力**。在需要从输入-输出对中寻找模式的ARC-AGI-2任务上，尽管当前最优VLM的准确率不足5%，Sora-2仍能经常做出**合理预测**，尽管其结果未必与数据集标注严格匹配。 4. **统一多模态推理**：在文本中心任务上，Sora-2在文本与多模态推理基准上展现出令人惊喜的性能。视频生成模型能够**将文本嵌入视频帧中**，实现统一的多模态理解与生成，这证明「以视频思考」有望成为**统一多模态推理范式**。 5. **更多示例可提升上下文学习效果**：通过提供更多示例，Sora-2可实现更优的上下文学习。实验表明，相较于仅提供1个示例，提供全部示例时Sora-2的表现更佳，这揭示了一个尚未被充分探索的方向：分析并提升视频生成模型的上下文学习能力。 6. **自一致性可提升测试阶段性能**：**自一致性可提升**Sora-2在可验证的视频生成推理任务上的性能。这揭示了一个尚未被充分探索的方向：**视频生成推理任务中的测试阶段性能缩放**。 7. **能力来源分析**：我们系统分析了**Sora-2的能力来源**。在适配后的数学问题上，Sora-2的表现与原始测试集相当，降低了测试集泄露的可能性。然而，即便能给出正确的最终答案，Sora-2仍难以在视频中生成连贯的推理过程。通过与Wan 2.5的对比实验，我们推测Sora-2的文本中心推理能力源自其**提示重写器（prompt rewriter）模型**。 ## ⚖️ 许可证  [![代码许可证](https://img.shields.io/badge/Code%20License-MIT-green.svg)](LICENSE) 本项目采用MIT许可证——详情请参阅LICENSE文件。 ## 🔎 引用若您认为我们的工作对您有所帮助，请考虑引用我们的论文 📝 并点亮仓库星标 ⭐️！ bibtex @article{tong2025thinking, title={Thinking with video: Video generation as a promising multimodal reasoning paradigm}, author={Tong, Jingqi and Mou, Yurong and Li, Hangcheng and Li, Mingzhe and Yang, Yongzhuo and Zhang, Ming and Chen, Qiguang and Liang, Tianyi and Hu, Xiaomeng and Zheng, Yining and others}, journal={arXiv preprint arXiv:2511.04570}, year={2025} } --- <div align="center"> ❤️ 为推进多模态推理研究而制作 </div>

提供机构：

maas

创建时间：

2025-11-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集