five

WildSpeech-Bench

收藏
魔搭社区2025-12-05 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/tencent-community/WildSpeech-Bench
下载链接
链接失效反馈
官方服务:
资源简介:
<h2 align="center" style="font-size: 2.5em; font-weight: bold; color: #2c3e50;"> WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild </h2> <p align="center"> <a href="https://huggingface.co/datasets/tencent/WildSpeech-Bench" style="margin: 0 10px;">🤗 Dataset</a> | <a href="https://github.com/Tencent/WildSpeech-Bench" style="margin: 0 10px;">🐙 GitHub</a> <a href="https://arxiv.org/abs/2506.21875" style="margin: 0 10px;">📖 Arxiv</a> </p> This repository contains the evaluation code for the paper "[WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild](https://arxiv.org/abs/2506.21875)". --- ## 🔔 Introduction <p align="center"> <img src="assets/wildspeech.jpg" alt="WildSpeech Overview" style="width: 800px;"> </p> **WildSpeech-Bench** is the first benchmark for evaluating the **speech-to-speech** capabilities of speechLLMs, characterized by both its evaluation framework and its construction process. ## 🪝 Construction <p align="center"> <img src="assets/wildspeech_construction.jpg" alt="WildSpeech Overview" style="width: 800px;"> </p> Our benchmark construction process directly counters the limitations of current datasets, resulting in a curated collection of 1,100 queries organized into five major categories. Each category reflects a common user intent, facilitating granular analysis and ensuring comprehensive coverage of real-world demands on SpeechLLMs. This involves not only meticulously filtering for queries characteristic of spoken interaction but also a crucial subsequent phase of manual auditing, where **every selected query was validated by human experts** to ensure its quality and relevance. Our evaluation framework significantly improves the precision of LLM-based judging for S2S interactions. Moving beyond generic rubrics that often overlook critical nuances, we strategically employ unique evaluation prompts for challenging queries. Crucially, these are not generic templates but **meticulously hand-crafted checklists**, each manually authored and fine-tuned by our team to highlight a specific query’s characteristics and potential pitfalls. ## 🏆 Main Result Main evaluation results. TC, II, SR, OE, PF each stand for Text Creation, Information Inquiry, Solution Request, Opinion Exchange and Paralinguistic-Featured query. | Model | TC | II | SR | OE | PF | Avg. | |----------------------|------|------|------|------|------------------------|------| | Naive Pipeline | 5.55 | 4.98 | 5.51 | 5.18 | 4.84 | 5.24 | | Kimi-Audio | 4.45 | 4.33 | 4.79 | 4.70 | 4.92 | 4.54 | | GLM-4-Voice | 5.16 | 4.77 | 5.41 | 5.04 | 4.51 | 5.03 | | MiniCPM | 5.17 | 4.89 | 5.28 | 5.31 | 4.78 | 5.08 | | Qwen-2.5-omni | 5.98 | 5.84 | 6.66 | 6.16 | 4.46 | 6.01 | | GPT-4o-Audio | 6.74 | 6.06 | 6.39 | 6.32 | 6.01 | 6.29 | ## 🔦 Citation ```bibtex @misc{zhang2025wildspeechbenchbenchmarkingendtoendspeechllms, title={WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild}, author={Linhao Zhang and Jian Zhang and Bokai Lei and Chuhan Wu and Aiwei Liu and Wei Jia and Xiao Zhou}, year={2025}, eprint={2506.21875}, archivePrefix={arXiv}, primaryClass={cs.CL}, } ``` ## 📜 License See the [License.txt](./License.txt) file for details.

<h2 align="center" style="font-size: 2.5em; font-weight: bold; color: #2c3e50;">WildSpeech-Bench:面向真实场景的端到端语音大语言模型(SpeechLLM)评测基准</h2> <p align="center"> <a href="https://huggingface.co/datasets/tencent/WildSpeech-Bench" style="margin: 0 10px;">🤗 数据集</a> | <a href="https://github.com/Tencent/WildSpeech-Bench" style="margin: 0 10px;">🐙 GitHub 仓库</a> <a href="https://arxiv.org/abs/2506.21875" style="margin: 0 10px;">📖 ArXiv 论文</a> </p> <p>本仓库包含论文《WildSpeech-Bench:面向真实场景的端到端语音大语言模型评测基准》对应的评测代码。</p> --- ## 🔔 简介 <p align="center"> <img src="assets/wildspeech.jpg" alt="WildSpeech-Bench 概览" style="width: 800px;"> </p> **WildSpeech-Bench** 是首个用于评测语音大语言模型(SpeechLLM)**语音到语音(speech-to-speech, S2S)**能力的评测基准,其特色在于评测框架与构建流程两方面。 ## 🪝 数据集构建 <p align="center"> <img src="assets/wildspeech_construction.jpg" alt="WildSpeech-Bench 构建流程" style="width: 800px;"> </p> 本基准的构建流程直击现有数据集的局限性,最终整理得到1100条查询样本,划分为五大类别。每一类均对应一类典型用户意图,便于进行精细化分析,并确保全面覆盖真实场景下对语音大语言模型的需求。该流程不仅严格筛选符合口语交互特征的查询样本,还增设了关键的人工审核阶段:**所有入选查询均经过人类专家验证,以确保其质量与相关性**。 本评测框架大幅提升了基于大语言模型的语音到语音(S2S)交互评测精度。相较于通常忽略关键细节的通用评分准则,我们针对高难度查询定制了专属评测提示。尤为重要的是,这些提示并非通用模板,而是**由团队成员手动编写并微调的精细化核查清单**,每一份清单均针对特定查询的特征与潜在陷阱进行了针对性设计。 ## 🏆 主要评测结果 本次核心评测结果如下。TC、II、SR、OE、PF分别代表文本创作(Text Creation)、信息查询(Information Inquiry)、解决方案请求(Solution Request)、观点交流(Opinion Exchange)与带副语言特征的查询(Paralinguistic-Featured query)。 | 模型 | 文本创作 | 信息查询 | 解决方案请求 | 观点交流 | 副语言特征查询 | 平均得分 | |----------------------|------|------|------|------|------------------------|------| | Naive Pipeline | 5.55 | 4.98 | 5.51 | 5.18 | 4.84 | 5.24 | | Kimi-Audio | 4.45 | 4.33 | 4.79 | 4.70 | 4.92 | 4.54 | | GLM-4-Voice | 5.16 | 4.77 | 5.41 | 5.04 | 4.51 | 5.03 | | MiniCPM | 5.17 | 4.89 | 5.28 | 5.31 | 4.78 | 5.08 | | Qwen-2.5-omni | 5.98 | 5.84 | 6.66 | 6.16 | 4.46 | 6.01 | | GPT-4o-Audio | 6.74 | 6.06 | 6.39 | 6.32 | 6.01 | 6.29 | ## 🔦 引用格式 bibtex @misc{zhang2025wildspeechbenchbenchmarkingendtoendspeechllms, title={WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild}, author={Linhao Zhang and Jian Zhang and Bokai Lei and Chuhan Wu and Aiwei Liu and Wei Jia and Xiao Zhou}, year={2025}, eprint={2506.21875}, archivePrefix={arXiv}, primaryClass={cs.CL}, } ## 📜 许可证 详细信息请参阅 [License.txt](./License.txt) 文件。
提供机构:
maas
创建时间:
2025-09-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作