SpokenVisIT
收藏魔搭社区2025-12-04 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/ICTNLP/SpokenVisIT
下载链接
链接失效反馈官方服务:
资源简介:
SpokenVisIT
SpokenVisIT is a real-world visual-speech interaction benchmark built upon VisIT-Bench, designed to evaluate the visual-grounded speech interaction capabilities of omni large multimodal models (LMMs).
Our deepest acknowledgment goes to [VisIT-Bench](https://huggingface.co/datasets/mlfoundations/VisIT-Bench) — A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use — which collects a diverse set of real-world visual instructions. SpokenVisIT builds on this foundation by converting the textual instructions into spoken language, enabling the assessment of LMMs' capabilities in spoken interaction. **Please use SpokenVisIT under the license terms of VisIT-Bench.**
For more information on VisIT-Bench, please refer to the [paper](https://arxiv.org/abs/2308.06595), [blog](https://visit-bench.github.io/), and [code](https://github.com/mlfoundations/VisIT-Bench/).
For more information on SpokenVisIT, please refer to the [paper]() and [GitHub repo](https://github.com/ictnlp/Stream-Omni) of Stream-Omni.
SpokenVisIT
SpokenVisIT 是一款基于 VisIT-Bench 构建的真实世界视觉-语音交互基准测试集,旨在评估全模态大多模态模型(Large Multimodal Models, LMMs)的视觉锚定语音交互能力。
我们谨向 [VisIT-Bench](https://huggingface.co/datasets/mlfoundations/VisIT-Bench) 致以最诚挚的谢意——这款受真实世界应用启发的视觉语言指令跟随基准数据集,收录了丰富多样的真实世界视觉指令。SpokenVisIT 以此为基础,将其中的文本指令转换为语音形式,从而实现对大多模态模型语音交互能力的评估。**请遵照 VisIT-Bench 的许可条款使用 SpokenVisIT。**
如需了解 VisIT-Bench 的更多信息,请参阅其[论文](https://arxiv.org/abs/2308.06595)、[博客](https://visit-bench.github.io/)及[代码仓库](https://github.com/mlfoundations/VisIT-Bench/)。
如需了解 SpokenVisIT 的更多信息,请参阅 Stream-Omni 的[论文]()及 [GitHub 仓库](https://github.com/ictnlp/Stream-Omni)。
提供机构:
maas
创建时间:
2025-06-19



