WildSpeech-Bench

Name: WildSpeech-Bench
Creator: maas
Published: 2025-12-05 16:51:11
License: 暂无描述

魔搭社区2025-12-05 更新2025-10-04 收录

下载链接：

https://modelscope.cn/datasets/tencent-community/WildSpeech-Bench

下载链接

链接失效反馈

官方服务：

资源简介：

<h2 align="center" style="font-size: 2.5em; font-weight: bold; color: #2c3e50;"> WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild </h2> <p align="center"> <a href="https://huggingface.co/datasets/tencent/WildSpeech-Bench" style="margin: 0 10px;">🤗 Dataset</a> | <a href="https://github.com/Tencent/WildSpeech-Bench" style="margin: 0 10px;">🐙 GitHub</a> <a href="https://arxiv.org/abs/2506.21875" style="margin: 0 10px;">📖 Arxiv</a> </p> This repository contains the evaluation code for the paper "[WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild](https://arxiv.org/abs/2506.21875)". --- ## 🔔 Introduction <p align="center"> <img src="assets/wildspeech.jpg" alt="WildSpeech Overview" style="width: 800px;"> </p> **WildSpeech-Bench** is the first benchmark for evaluating the **speech-to-speech** capabilities of speechLLMs, characterized by both its evaluation framework and its construction process. ## 🪝 Construction <p align="center"> <img src="assets/wildspeech_construction.jpg" alt="WildSpeech Overview" style="width: 800px;"> </p> Our benchmark construction process directly counters the limitations of current datasets, resulting in a curated collection of 1,100 queries organized into five major categories. Each category reflects a common user intent, facilitating granular analysis and ensuring comprehensive coverage of real-world demands on SpeechLLMs. This involves not only meticulously filtering for queries characteristic of spoken interaction but also a crucial subsequent phase of manual auditing, where **every selected query was validated by human experts** to ensure its quality and relevance. Our evaluation framework significantly improves the precision of LLM-based judging for S2S interactions. Moving beyond generic rubrics that often overlook critical nuances, we strategically employ unique evaluation prompts for challenging queries. Crucially, these are not generic templates but **meticulously hand-crafted checklists**, each manually authored and fine-tuned by our team to highlight a specific query’s characteristics and potential pitfalls. ## 🏆 Main Result Main evaluation results. TC, II, SR, OE, PF each stand for Text Creation, Information Inquiry, Solution Request, Opinion Exchange and Paralinguistic-Featured query. | Model | TC | II | SR | OE | PF | Avg. | |----------------------|------|------|------|------|------------------------|------| | Naive Pipeline | 5.55 | 4.98 | 5.51 | 5.18 | 4.84 | 5.24 | | Kimi-Audio | 4.45 | 4.33 | 4.79 | 4.70 | 4.92 | 4.54 | | GLM-4-Voice | 5.16 | 4.77 | 5.41 | 5.04 | 4.51 | 5.03 | | MiniCPM | 5.17 | 4.89 | 5.28 | 5.31 | 4.78 | 5.08 | | Qwen-2.5-omni | 5.98 | 5.84 | 6.66 | 6.16 | 4.46 | 6.01 | | GPT-4o-Audio | 6.74 | 6.06 | 6.39 | 6.32 | 6.01 | 6.29 | ## 🔦 Citation ```bibtex @misc{zhang2025wildspeechbenchbenchmarkingendtoendspeechllms, title={WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild}, author={Linhao Zhang and Jian Zhang and Bokai Lei and Chuhan Wu and Aiwei Liu and Wei Jia and Xiao Zhou}, year={2025}, eprint={2506.21875}, archivePrefix={arXiv}, primaryClass={cs.CL}, } ``` ## 📜 License See the [License.txt](./License.txt) file for details.

<h2 align="center" style="font-size: 2.5em; font-weight: bold; color: #2c3e50;">WildSpeech-Bench：面向真实场景的端到端语音大语言模型（SpeechLLM）评测基准</h2> <p align="center"> <a href="https://huggingface.co/datasets/tencent/WildSpeech-Bench" style="margin: 0 10px;">🤗 数据集</a> | <a href="https://github.com/Tencent/WildSpeech-Bench" style="margin: 0 10px;">🐙 GitHub 仓库</a> <a href="https://arxiv.org/abs/2506.21875" style="margin: 0 10px;">📖 ArXiv 论文</a> </p> <p>本仓库包含论文《WildSpeech-Bench：面向真实场景的端到端语音大语言模型评测基准》对应的评测代码。</p> --- ## 🔔 简介 <p align="center"> <img src="assets/wildspeech.jpg" alt="WildSpeech-Bench 概览" style="width: 800px;"> </p> **WildSpeech-Bench** 是首个用于评测语音大语言模型（SpeechLLM）**语音到语音（speech-to-speech, S2S）**能力的评测基准，其特色在于评测框架与构建流程两方面。 ## 🪝 数据集构建 <p align="center"> <img src="assets/wildspeech_construction.jpg" alt="WildSpeech-Bench 构建流程" style="width: 800px;"> </p> 本基准的构建流程直击现有数据集的局限性，最终整理得到1100条查询样本，划分为五大类别。每一类均对应一类典型用户意图，便于进行精细化分析，并确保全面覆盖真实场景下对语音大语言模型的需求。该流程不仅严格筛选符合口语交互特征的查询样本，还增设了关键的人工审核阶段：**所有入选查询均经过人类专家验证，以确保其质量与相关性**。本评测框架大幅提升了基于大语言模型的语音到语音（S2S）交互评测精度。相较于通常忽略关键细节的通用评分准则，我们针对高难度查询定制了专属评测提示。尤为重要的是，这些提示并非通用模板，而是**由团队成员手动编写并微调的精细化核查清单**，每一份清单均针对特定查询的特征与潜在陷阱进行了针对性设计。 ## 🏆 主要评测结果本次核心评测结果如下。TC、II、SR、OE、PF分别代表文本创作（Text Creation）、信息查询（Information Inquiry）、解决方案请求（Solution Request）、观点交流（Opinion Exchange）与带副语言特征的查询（Paralinguistic-Featured query）。 | 模型 | 文本创作 | 信息查询 | 解决方案请求 | 观点交流 | 副语言特征查询 | 平均得分 | |----------------------|------|------|------|------|------------------------|------| | Naive Pipeline | 5.55 | 4.98 | 5.51 | 5.18 | 4.84 | 5.24 | | Kimi-Audio | 4.45 | 4.33 | 4.79 | 4.70 | 4.92 | 4.54 | | GLM-4-Voice | 5.16 | 4.77 | 5.41 | 5.04 | 4.51 | 5.03 | | MiniCPM | 5.17 | 4.89 | 5.28 | 5.31 | 4.78 | 5.08 | | Qwen-2.5-omni | 5.98 | 5.84 | 6.66 | 6.16 | 4.46 | 6.01 | | GPT-4o-Audio | 6.74 | 6.06 | 6.39 | 6.32 | 6.01 | 6.29 | ## 🔦 引用格式 bibtex @misc{zhang2025wildspeechbenchbenchmarkingendtoendspeechllms, title={WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild}, author={Linhao Zhang and Jian Zhang and Bokai Lei and Chuhan Wu and Aiwei Liu and Wei Jia and Xiao Zhou}, year={2025}, eprint={2506.21875}, archivePrefix={arXiv}, primaryClass={cs.CL}, } ## 📜 许可证详细信息请参阅 [License.txt](./License.txt) 文件。

提供机构：

maas

创建时间：

2025-09-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集