SpokenVisIT

Name: SpokenVisIT
Creator: maas
Published: 2025-12-04 16:38:39
License: 暂无描述

魔搭社区2025-12-04 更新2025-06-21 收录

下载链接：

https://modelscope.cn/datasets/ICTNLP/SpokenVisIT

下载链接

链接失效反馈

官方服务：

资源简介：

SpokenVisIT SpokenVisIT is a real-world visual-speech interaction benchmark built upon VisIT-Bench, designed to evaluate the visual-grounded speech interaction capabilities of omni large multimodal models (LMMs). Our deepest acknowledgment goes to [VisIT-Bench](https://huggingface.co/datasets/mlfoundations/VisIT-Bench) — A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use — which collects a diverse set of real-world visual instructions. SpokenVisIT builds on this foundation by converting the textual instructions into spoken language, enabling the assessment of LMMs' capabilities in spoken interaction. **Please use SpokenVisIT under the license terms of VisIT-Bench.** For more information on VisIT-Bench, please refer to the [paper](https://arxiv.org/abs/2308.06595), [blog](https://visit-bench.github.io/), and [code](https://github.com/mlfoundations/VisIT-Bench/). For more information on SpokenVisIT, please refer to the [paper]() and [GitHub repo](https://github.com/ictnlp/Stream-Omni) of Stream-Omni.

SpokenVisIT SpokenVisIT 是一款基于 VisIT-Bench 构建的真实世界视觉-语音交互基准测试集，旨在评估全模态大多模态模型（Large Multimodal Models, LMMs）的视觉锚定语音交互能力。我们谨向 [VisIT-Bench](https://huggingface.co/datasets/mlfoundations/VisIT-Bench) 致以最诚挚的谢意——这款受真实世界应用启发的视觉语言指令跟随基准数据集，收录了丰富多样的真实世界视觉指令。SpokenVisIT 以此为基础，将其中的文本指令转换为语音形式，从而实现对大多模态模型语音交互能力的评估。**请遵照 VisIT-Bench 的许可条款使用 SpokenVisIT。** 如需了解 VisIT-Bench 的更多信息，请参阅其[论文](https://arxiv.org/abs/2308.06595)、[博客](https://visit-bench.github.io/)及[代码仓库](https://github.com/mlfoundations/VisIT-Bench/)。如需了解 SpokenVisIT 的更多信息，请参阅 Stream-Omni 的[论文]()及 [GitHub 仓库](https://github.com/ictnlp/Stream-Omni)。

提供机构：

maas

创建时间：

2025-06-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集