FysicsWorld
收藏魔搭社区2026-01-08 更新2025-12-20 收录
下载链接:
https://modelscope.cn/datasets/Fysics-AI/FysicsWorld
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center" width="100%">
<a target="_blank"><img src="figs/FysicsWorld-logo.png" alt="" style="width: 50%; min-width: 200px; display: block; margin: auto;"></a>
</p>
<div align="center">
<br>
<h1>FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning</h1>
<font size=3><div align='center' >
[[🏠 Project Page](https://github.com/Fysics-AI/FysicsWorld)]
[[📖 Paper](https://arxiv.org/pdf/2512.12756)]
[[🤗 Dataset](https://huggingface.co/datasets/Fysics-AI/FysicsWorld)]
[[👾 ModelScope](https://www.modelscope.cn/datasets/Fysics-AI/FysicsWorld)]
[[🏆 Leaderboard](https://huggingface.co/spaces/Fysics-AI/FysicsWorld-Leaderboard)]
[[🀄 中文版](README_zh.md)]
</div></font>
</div>
## 🚀 News
* **`2025.12.14`** We release [***FysicsWorld***](https://huggingface.co/datasets/Fysics-AI/FysicsWorld), the first unified full-modality benchmark that supports bidirectional input–output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning.
## 🎯 ***FysicsWorld*** Overview
<img src="figs/fig-teaser.jpg" width="100%" height="100%">
We introduce ***FysicsWorld***, the **first** unified full-modality benchmark that supports bidirectional input–output across *image, video, audio, and text*, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. Our systematic design spans uni-modal perception tasks to fusion-dependent reasoning under strong cross-modal coupling, allowing us to diagnose, with unprecedented clarity, the limitations and emerging strengths of modern multimodal and omni-modal architectures. In contrast to existing omni-modal and multi-modal benchmarks, our ***FysicsWorld*** has several advantages:
* **Diversity and High Quality**. ***FysicsWorld*** is characterized by **8 "*multi*"** properties, reflecting its comprehensive coverage, diversity, and robustness, namely:
*multi-dimensional* (understanding, generation, reasoning, voice interaction), *multi-modal* (text, image, video, audio as both inputs and outputs), *multi-task* (16 primary tasks, 200+ sub-tasks), *multi-source* (3,268 samples from 40+ data sources and curated web data), *multi-domain* (170+ fine-grained open-domain categories), *multi-type* (closed-ended, open-ended, multiple-choice question, and image/video/audio generation), *multi-target* (evaluates Omni-LLMs, MLLMs, modality-specific models, unified understanding–generation models), and *multi-assurance* (multi-stage quality control strategies).
* **Fusion-Dependent Cross-Modal Reasoning**. We propose a method for omni-modal data construction, which is named **C**ross-**M**odal **C**omplementarity **S**creening (**CMCS**) strategy, which ensures that our tasks maintain strong cross-modal coupling, preventing single-modality shortcuts and enforcing true synergistic perception of omni-modality.
* **Speech-Driven Cross-Modal Interaction**. To support natural, multimodal communication and interaction, we develop a speech-grounded multimodal data construction pipeline that ensures both linguistic fluency and semantic fidelity in voice-based interactions, including 10+ authentic voices and tones.
Based on ***FysicsWorld***, we extensively evaluate various advanced models, including Omni-LLMs, MLLMs, modality-specific models, and unified understanding–generation models. By establishing a unified benchmark and highlighting key capability gaps, FysicsWorld provides not only a foundation for evaluating emerging multimodal systems but also a roadmap for the next generation of full-modality architectures capable of genuinely holistic perception, reasoning, and interaction.
<p align="center">
<img src="figs/fig-statiscs.jpg" width="100%" height="100%">
</p>
## 🔍 Dataset Download
The full dataset, including associated multimedia files (images, videos, and audio), can be downloaded from:
- Link-1(🤗 HuggingFace):[[Link](https://huggingface.co/datasets/Fysics-AI/FysicsWorld)]
- Link-2(🤗 HF-Mirror):[[Link](https://hf-mirror.com/datasets/Fysics-AI/FysicsWorld)]
- Link-3(👾 ModelScope):[[Link](https://www.modelscope.cn/datasets/Fysics-AI/FysicsWorld)]
## 🔮 Evaluation
To ensure a fair and standardized evaluation protocol, we release the full ***FysicsWorld*** dataset with ground-truth answers withheld, along with a test-mini subset (300 samples) that includes answers for local validation and debugging. You can find the QA data in [./data](https://huggingface.co/datasets/Fysics-AI/FysicsWorld/tree/main/data) (full ***FysicsWorld***) and [./test-mini](https://huggingface.co/datasets/Fysics-AI/FysicsWorld/tree/main/test-mini) (test-mini), respectively.
🕹️ **Usage**:
1. Download the full FysicsWorld dataset from [here](https://huggingface.co/datasets/Fysics-AI/FysicsWorld).
2. Run inference using your model on the provided questions.
3. Follow the [guidelines](https://github.com/Fysics-AI/FysicsWorld/blob/main/eval/submission/EVALUATION.md), and format the model responses according to the required [submission format](https://github.com/Fysics-AI/FysicsWorld/blob/main/eval/submission/submission_format.json).
4. Send the formatted responses to *dicken@fyscis.ai*. We will periodically update the corresponding scores on the leaderboard.
## 📈 Experimental Results
- **Evaluation results of Omni-LLMs and proprietary MLLMs on image-centric omni-modal tasks**
<p align="center">
<img src="figs/tab-image.png" width="90%" height="100%">
</p>
*Task abbreviations:*
Task1-1 (Image Understanding), Task2-1 (Speech-Driven Image Understanding), Task2-2 (Image–Audio Contextual Reasoning), Task2-3 (Speech-Based QA on Image Content), Task2-4 (Speech Generation from a Person in an Image), and Task2-5 (Audio Matching from Image Context).
- **Evaluation results of Omni-LLMs and proprietary MLLMs on video-centric omni-modal tasks.**
<p align="center">
<img src="figs/tab-video.png" width="90%" height="100%">
</p>
*Task abbreviations:*
Task1-2 (Video Understanding), Task3-1 (Speech-Driven Video Understanding), Task3-2 (Video–Audio Contextual Reasoning), Task3-3 (Speech-Based QA on Video Content), Task3-4 (Speech Generation from a Person in an Video), Task3-5 (Audio Matching from Video Context), and Task3-6 (Next-Action Prediction from Video Sequences and Current Visual State).
- **Evaluation results of open-source MLLMs on modality-supported tasks.**
<p align="center">
<img src="figs/fig-open-mllm.jpg" width="60%" height="100%">
</p>
*Task abbreviations:*
Task1-1 (Image Understanding), Task1-2 (Video Understanding), and Task3-6 (Next-Action Prediction from Video Sequences and Current Visual State).
- **Evaluation results of various models on (a) Audio Reasoning and (b) Video Generation.**
<p align="center">
<img src="figs/fig-exp-audio-video.jpg" width="90%" height="100%">
</p>
## 📖 Citation
If you find ***FysicsWorld*** helpful for your research, please consider citing our work. Thanks!
```bibtex
@article{jiang2025fysicsworld,
title={FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning},
author={Jiang, Yue and Yang, Dingkang and Han, Minghao and Han, Jinghang and Chen, Zizhi and Liu, Yizhou and Li, Mingcheng and Zhai, Peng and Zhang, Lihua},
journal={arXiv preprint arXiv:2512.12756},
year={2025}
}
```
<p align="center" width="100%">
<a target="_blank"><img src="figs/FysicsWorld-logo.png" alt="" style="width: 50%; min-width: 200px; display: block; margin: auto;"></a>
</p>
<div align="center">
<br>
<h1>FysicsWorld:面向任意模态理解、生成与推理的统一全模态基准</h1>
<font size=3><div align='center' >
[[🏠 项目主页](https://github.com/Fysics-AI/FysicsWorld)]
[[📖 论文](https://arxiv.org/pdf/2512.12756)]
[[🤗 数据集](https://huggingface.co/datasets/Fysics-AI/FysicsWorld)]
[[👾 ModelScope](https://www.modelscope.cn/datasets/Fysics-AI/FysicsWorld)]
[[🏆 排行榜](https://huggingface.co/spaces/Fysics-AI/FysicsWorld-Leaderboard)]
[[🀄 中文版](README_zh.md)]
</div></font>
</div>
## 🚀 动态
* **`2025.12.14`** 我们发布了[***FysicsWorld***](https://huggingface.co/datasets/Fysics-AI/FysicsWorld),这是首个支持图像、视频、音频与文本双向输入输出的统一全模态(Full-Modality)基准,可实现理解、生成与推理领域的全方位任意模态评估。
## 🎯 FysicsWorld 概述
<img src="figs/fig-teaser.jpg" width="100%" height="100%">
我们提出了**FysicsWorld**,这是**首个**支持图像、视频、音频与文本双向输入输出的统一全模态基准,可实现理解、生成与推理领域的全方位任意模态评估。我们的系统化设计涵盖单模态感知任务到强跨模态耦合下的融合依赖推理任务,能够前所未有的清晰诊断当前多模态与全模态架构的局限性与新兴优势。与现有全模态、多模态基准相比,**FysicsWorld**具备以下多项优势:
* **多样性与高质量**。**FysicsWorld**具备**8项“多”维度特性**,全面体现了其覆盖范围、多样性与鲁棒性,具体包括:*多维度*(覆盖理解、生成、推理、语音交互)、*多模态*(文本、图像、视频、音频均可作为输入与输出)、*多任务*(16项主任务、200+子任务)、*多源数据*(源自40+数据源与精选网页数据的3268个样本)、*多领域*(170+细粒度开放域类别)、*多类型*(封闭式问答、开放式问答、选择题、图像/视频/音频生成)、*多评估目标*(可评估全模态大语言模型(Omni-LLMs)、多模态大语言模型(MLLMs)、模态专属模型、统一理解-生成模型)以及*多重质量保障*(多阶段质量控制策略)。
* **融合依赖的跨模态推理**。我们提出了一种全模态数据构建方法,命名为**跨模态互补筛选(Cross-Modal Complementarity Screening, CMCS)**策略,该策略可确保我们的任务具备强跨模态耦合性,避免单模态捷径问题,强制实现真正的全模态协同感知。
* **语音驱动的跨模态交互**。为支持自然的多模态沟通与交互,我们开发了基于语音的多模态数据构建流程,可确保语音交互中的语言流畅性与语义保真度,涵盖10+种真实语音与语调。
基于**FysicsWorld**,我们对各类先进模型进行了广泛评估,包括全模态大语言模型(Omni-LLMs)、多模态大语言模型(MLLMs)、模态专属模型以及统一理解-生成模型。通过构建统一基准并明确关键能力缺口,FysicsWorld不仅为评估新兴多模态系统提供了基础,也为下一代具备真正全局感知、推理与交互能力的全模态架构指明了发展路径。
<p align="center">
<img src="figs/fig-statiscs.jpg" width="100%" height="100%">
</p>
## 🔍 数据集下载
完整数据集及配套多媒体文件(图像、视频与音频)可通过以下渠道下载:
- 链接1(🤗 HuggingFace):[[链接](https://huggingface.co/datasets/Fysics-AI/FysicsWorld)]
- 链接2(🤗 HF镜像站):[[链接](https://hf-mirror.com/datasets/Fysics-AI/FysicsWorld)]
- 链接3(👾 ModelScope):[[链接](https://www.modelscope.cn/datasets/Fysics-AI/FysicsWorld)]
## 🔮 评估
为确保公平且标准化的评估流程,我们发布了完整的**FysicsWorld**数据集,暂未公开标准答案,同时附带一个测试迷你子集(300个样本),包含标准答案用于本地验证与调试。你可分别在[./data](https://huggingface.co/datasets/Fysics-AI/FysicsWorld/tree/main/data)(完整FysicsWorld数据集)与[./test-mini](https://huggingface.co/datasets/Fysics-AI/FysicsWorld/tree/main/test-mini)(测试迷你子集)中找到问答数据。
🕹️ **使用方法**:
1. 从[此处](https://huggingface.co/datasets/Fysics-AI/FysicsWorld)下载完整FysicsWorld数据集。
2. 使用你的模型对提供的问题进行推理。
3. 遵循[评估指南](https://github.com/Fysics-AI/FysicsWorld/blob/main/eval/submission/EVALUATION.md),并按照要求的[提交格式](https://github.com/Fysics-AI/FysicsWorld/blob/main/eval/submission/submission_format.json)格式化模型输出。
4. 将格式化后的结果发送至*dicken@fyscis.ai*。我们将定期更新排行榜上的对应评分。
## 📈 实验结果
- **以图像为中心的全模态任务上的全模态大语言模型(Omni-LLMs)与专有多模态大语言模型(MLLMs)评估结果**
<p align="center">
<img src="figs/tab-image.png" width="90%" height="100%">
</p>
*任务缩写说明:*
任务1-1(图像理解)、任务2-1(语音驱动的图像理解)、任务2-2(图像-音频上下文推理)、任务2-3(基于图像内容的语音问答)、任务2-4(基于图像中人物的语音生成)以及任务2-5(基于图像上下文的音频匹配)。
- **以视频为中心的全模态任务上的全模态大语言模型(Omni-LLMs)与专有多模态大语言模型(MLLMs)评估结果。**
<p align="center">
<img src="figs/tab-video.png" width="90%" height="100%">
</p>
*任务缩写说明:*
任务1-2(视频理解)、任务3-1(语音驱动的视频理解)、任务3-2(视频-音频上下文推理)、任务3-3(基于视频内容的语音问答)、任务3-4(基于视频中人物的语音生成)、任务3-5(基于视频上下文的音频匹配)以及任务3-6(基于视频序列与当前视觉状态的下一动作预测)。
- **开源多模态大语言模型(MLLMs)在支持模态任务上的评估结果。**
<p align="center">
<img src="figs/fig-open-mllm.jpg" width="60%" height="100%">
</p>
*任务缩写说明:*
任务1-1(图像理解)、任务1-2(视频理解)以及任务3-6(基于视频序列与当前视觉状态的下一动作预测)。
- **各类模型在(a)音频推理与(b)视频生成任务上的评估结果。**
<p align="center">
<img src="figs/fig-exp-audio-video.jpg" width="90%" height="100%">
</p>
## 📖 引用
如果**FysicsWorld**对你的研究有所帮助,请考虑引用我们的工作。感谢!
bibtex
@article{jiang2025fysicsworld,
title={FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning},
author={Jiang, Yue and Yang, Dingkang and Han, Minghao and Han, Jinghang and Chen, Zizhi and Liu, Yizhou and Li, Mingcheng and Zhai, Peng and Zhang, Lihua},
journal={arXiv preprint arXiv:2512.12756},
year={2025}
}
提供机构:
maas
创建时间:
2025-12-20



