WildSpeech-Bench
收藏魔搭社区2025-12-05 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/tencent-community/WildSpeech-Bench
下载链接
链接失效反馈官方服务:
资源简介:
<h2 align="center" style="font-size: 2.5em; font-weight: bold; color: #2c3e50;">
WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild
</h2>
<p align="center">
<a href="https://huggingface.co/datasets/tencent/WildSpeech-Bench" style="margin: 0 10px;">🤗 Dataset</a> |
<a href="https://github.com/Tencent/WildSpeech-Bench" style="margin: 0 10px;">🐙 GitHub</a>
<a href="https://arxiv.org/abs/2506.21875" style="margin: 0 10px;">📖 Arxiv</a>
</p>
This repository contains the evaluation code for the paper "[WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild](https://arxiv.org/abs/2506.21875)".
---
## 🔔 Introduction
<p align="center">
<img src="assets/wildspeech.jpg" alt="WildSpeech Overview" style="width: 800px;">
</p>
**WildSpeech-Bench** is the first benchmark for evaluating the **speech-to-speech** capabilities of speechLLMs, characterized by both its evaluation framework and its construction process.
## 🪝 Construction
<p align="center">
<img src="assets/wildspeech_construction.jpg" alt="WildSpeech Overview" style="width: 800px;">
</p>
Our benchmark construction process directly counters the limitations of current datasets, resulting
in a curated collection of 1,100 queries organized into five major categories. Each category reflects a
common user intent, facilitating granular analysis and ensuring comprehensive coverage of real-world
demands on SpeechLLMs. This involves not only meticulously filtering for queries characteristic of spoken interaction but also a crucial subsequent phase of manual auditing, where **every selected query
was validated by human experts** to ensure its quality and relevance.
Our evaluation framework significantly improves the precision of LLM-based judging for S2S
interactions. Moving beyond generic rubrics that often overlook critical nuances, we strategically
employ unique evaluation prompts for challenging queries. Crucially, these are not generic templates
but **meticulously hand-crafted checklists**, each manually authored and fine-tuned by our team to
highlight a specific query’s characteristics and potential pitfalls.
## 🏆 Main Result
Main evaluation results. TC, II, SR, OE, PF each stand for Text Creation, Information Inquiry, Solution Request, Opinion Exchange and Paralinguistic-Featured query.
| Model | TC | II | SR | OE | PF | Avg. |
|----------------------|------|------|------|------|------------------------|------|
| Naive Pipeline | 5.55 | 4.98 | 5.51 | 5.18 | 4.84 | 5.24 |
| Kimi-Audio | 4.45 | 4.33 | 4.79 | 4.70 | 4.92 | 4.54 |
| GLM-4-Voice | 5.16 | 4.77 | 5.41 | 5.04 | 4.51 | 5.03 |
| MiniCPM | 5.17 | 4.89 | 5.28 | 5.31 | 4.78 | 5.08 |
| Qwen-2.5-omni | 5.98 | 5.84 | 6.66 | 6.16 | 4.46 | 6.01 |
| GPT-4o-Audio | 6.74 | 6.06 | 6.39 | 6.32 | 6.01 | 6.29 |
## 🔦 Citation
```bibtex
@misc{zhang2025wildspeechbenchbenchmarkingendtoendspeechllms,
title={WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild},
author={Linhao Zhang and Jian Zhang and Bokai Lei and Chuhan Wu and Aiwei Liu and Wei Jia and Xiao Zhou},
year={2025},
eprint={2506.21875},
archivePrefix={arXiv},
primaryClass={cs.CL},
}
```
## 📜 License
See the [License.txt](./License.txt) file for details.
<h2 align="center" style="font-size: 2.5em; font-weight: bold; color: #2c3e50;">WildSpeech-Bench:面向真实场景的端到端语音大语言模型(SpeechLLM)评测基准</h2>
<p align="center">
<a href="https://huggingface.co/datasets/tencent/WildSpeech-Bench" style="margin: 0 10px;">🤗 数据集</a> |
<a href="https://github.com/Tencent/WildSpeech-Bench" style="margin: 0 10px;">🐙 GitHub 仓库</a>
<a href="https://arxiv.org/abs/2506.21875" style="margin: 0 10px;">📖 ArXiv 论文</a>
</p>
<p>本仓库包含论文《WildSpeech-Bench:面向真实场景的端到端语音大语言模型评测基准》对应的评测代码。</p>
---
## 🔔 简介
<p align="center">
<img src="assets/wildspeech.jpg" alt="WildSpeech-Bench 概览" style="width: 800px;">
</p>
**WildSpeech-Bench** 是首个用于评测语音大语言模型(SpeechLLM)**语音到语音(speech-to-speech, S2S)**能力的评测基准,其特色在于评测框架与构建流程两方面。
## 🪝 数据集构建
<p align="center">
<img src="assets/wildspeech_construction.jpg" alt="WildSpeech-Bench 构建流程" style="width: 800px;">
</p>
本基准的构建流程直击现有数据集的局限性,最终整理得到1100条查询样本,划分为五大类别。每一类均对应一类典型用户意图,便于进行精细化分析,并确保全面覆盖真实场景下对语音大语言模型的需求。该流程不仅严格筛选符合口语交互特征的查询样本,还增设了关键的人工审核阶段:**所有入选查询均经过人类专家验证,以确保其质量与相关性**。
本评测框架大幅提升了基于大语言模型的语音到语音(S2S)交互评测精度。相较于通常忽略关键细节的通用评分准则,我们针对高难度查询定制了专属评测提示。尤为重要的是,这些提示并非通用模板,而是**由团队成员手动编写并微调的精细化核查清单**,每一份清单均针对特定查询的特征与潜在陷阱进行了针对性设计。
## 🏆 主要评测结果
本次核心评测结果如下。TC、II、SR、OE、PF分别代表文本创作(Text Creation)、信息查询(Information Inquiry)、解决方案请求(Solution Request)、观点交流(Opinion Exchange)与带副语言特征的查询(Paralinguistic-Featured query)。
| 模型 | 文本创作 | 信息查询 | 解决方案请求 | 观点交流 | 副语言特征查询 | 平均得分 |
|----------------------|------|------|------|------|------------------------|------|
| Naive Pipeline | 5.55 | 4.98 | 5.51 | 5.18 | 4.84 | 5.24 |
| Kimi-Audio | 4.45 | 4.33 | 4.79 | 4.70 | 4.92 | 4.54 |
| GLM-4-Voice | 5.16 | 4.77 | 5.41 | 5.04 | 4.51 | 5.03 |
| MiniCPM | 5.17 | 4.89 | 5.28 | 5.31 | 4.78 | 5.08 |
| Qwen-2.5-omni | 5.98 | 5.84 | 6.66 | 6.16 | 4.46 | 6.01 |
| GPT-4o-Audio | 6.74 | 6.06 | 6.39 | 6.32 | 6.01 | 6.29 |
## 🔦 引用格式
bibtex
@misc{zhang2025wildspeechbenchbenchmarkingendtoendspeechllms,
title={WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild},
author={Linhao Zhang and Jian Zhang and Bokai Lei and Chuhan Wu and Aiwei Liu and Wei Jia and Xiao Zhou},
year={2025},
eprint={2506.21875},
archivePrefix={arXiv},
primaryClass={cs.CL},
}
## 📜 许可证
详细信息请参阅 [License.txt](./License.txt) 文件。
提供机构:
maas
创建时间:
2025-09-26



