下载链接：

https://modelscope.cn/datasets/FreedomIntelligence/DitingBench

下载链接

链接失效反馈

官方服务：

资源简介：

# Diting Benchmark [Our paper](https://arxiv.org/abs/2410.13268) [Github](https://github.com/ciwei6107563/Diting-Benchmark/tree/main) Our benchmark is designed to evaluate the speech comprehension capabilities of Speech LLMs. We tested both humans and Speech LLMs in terms of speech understanding and provided further analysis of the results, along with a comparative study between the two. This offers insights for the future development of Speech LLMs. For more details, please refer to our paper. ## Result | **Level** | **Task** | **Human Baseline** | **GPT-4o** | **MuLLaMA** | **GAMA** | **SALMONN** | **Qwen2-Audio** | |-----------|------------------------------|--------------------|------------|-------------|----------|-------------|-----------------| | **L1** | Language Identification | ✘ | 88.50% | 8.48% | ✘ | 35.17% | 96.44% | | | Auto-Speech Recognition | 15.49* | 10.24* | ✘ | ✘ | 5.45* | 4.63* | | | ASR for Legal Terms | 98.50% | 26.47% | ✘ | ✘ | ✘ | 81.04% | | | ASR for Medical Terms | 97.50% | 41.87% | ✘ | ✘ | ✘ | 53.86% | | | Auto-Lyrics Transcription | 26.88* | ✘ | ✘ | ✘ | 77.12* | 32.48* | | | - Hallucination Rate | 3.00% | ✘ | ✘ | ✘ | 29.26% | 38.21% | | **L2** | Volume Perception | 100.00% | ✘ | 50.00% | 11.98% | 53.22% | 48.96% | | | Pitch Perception | 96.25% | 29.33% | 33.78% | 41.50% | 50.00% | 50.00% | | | Binaural Effect Perception | 100.00% | 41.38% | ✘ | ✘ | 49.88% | ✘ | | | Loudness Assessment | 85.63% | ✘ | 49.77% | ✘ | ✘ | 50.13% | | | Speech Rate Assessment | 76.25% | ✘ | 50.00% | ✘ | ✘ | 44.93% | | | Speech Pause Detection | 91.88% | ✘ | 50.00% | 49.97% | ✘ | 51.70% | | **L3** | Ambient Noise Detection | 91.88% | 45.27% | 50.00% | 60.17% | 49.88% | 50.00% | | | Acoustic Scene Classification | 90.28% | 16.36% | 5.07% | 12.05% | 20.74% | 27.67% | | | Speaker’s Age Prediction | 52.59% | 13.43% | 33.60% | ✘ | 36.87% | 38.55% | | | Speaker’s Gender Recognition | 97.50% | ✘ | 50.00% | ✘ | 48.12% | 79.60% | | | Speech Emotion Recognition | 50.71% | 16.77% | 9.20% | 3.68% | 10.93% | 79.51% | | | Cappella Emotion Recognition | 62.25% | 21.50% | 12.42% | 7.08% | 14.62% | 62.38% | | | Emotion Intensity Perception | 97.50% | 72.67% | 50.00% | 50.00% | 49.29% | 50.00% | | | Emotion Translation | 3.68 | 0.32 | ✘ | ✘ | 0.27 | 0.31 | | | Singing Detection | 99.38% | 53.11% | 50.00% | 64.82% | 56.47% | 50.22% | | **L4** | COVID-19 Risk Detection | 60.63% | ✘ | ✘ | ✘ | 50.00% | 14.17% | | | Cough Type Classification | 52.50% | 40.33% | 50.16% | 44.17% | 49.17% | 43.39% | | | Cough Origin Diagnosis | 32.19% | ✘ | ✘ | ✘ | 4.01% | 25.65% | | | Cough Severity Assessment | 45.42% | 24.12% | 30.85% | 28.50% | 38.24% | 33.86% | | | Lung Risk Screening | 49.38% | ✘ | 47.62% | ✘ | ✘ | 50.16% | | **L5** | Spoken English Coach| 1.39 | 0.15 | 1.29 | 0.44 | 0.48 | 0.54 | | | Voice Detective | 1.20 | ✘ | 0.84 | 0.83 | 0.86 | 1.24 | **Note**: - "`✘`" indicates that the model fails to follow the instruction. - "`*`" denotes that the metric is Word Error Rate (WER) and similar metrics, for which lower values indicate better performance.

# 谛听基准（Diting Benchmark） [相关论文](https://arxiv.org/abs/2410.13268) [Github仓库](https://github.com/ciwei6107563/Diting-Benchmark/tree/main) 本基准数据集旨在评估语音大语言模型（Speech LLMs）的语音理解能力。我们分别针对人类受试者与语音大语言模型开展了语音理解相关测试，并对测试结果进行了深入分析，同时完成了二者之间的对比研究，以期为语音大语言模型的未来发展提供参考。如需了解更多细节，请参阅我们的相关论文。 ## 测试结果 | **难度层级** | **测试任务** | **人类基准表现** | **GPT-4o** | **MuLLaMA** | **GAMA** | **SALMONN** | **Qwen2-Audio** | |-----------|------------------------------|--------------------|------------|-------------|----------|-------------|-----------------| | **L1** | 语言辨识（Language Identification） | ✘ | 88.50% | 8.48% | ✘ | 35.17% | 96.44% | | | 自动语音识别（Auto-Speech Recognition） | 15.49* | 10.24* | ✘ | ✘ | 5.45* | 4.63* | | | 法律术语自动语音识别（ASR） | 98.50% | 26.47% | ✘ | ✘ | ✘ | 81.04% | | | 医学术语自动语音识别（ASR） | 97.50% | 41.87% | ✘ | ✘ | ✘ | 53.86% | | | 自动歌词转录 | 26.88* | ✘ | ✘ | ✘ | 77.12* | 32.48* | | | - 幻觉率 | 3.00% | ✘ | ✘ | ✘ | 29.26% | 38.21% | | **L2** | 音量感知 | 100.00% | ✘ | 50.00% | 11.98% | 53.22% | 48.96% | | | 音高感知 | 96.25% | 29.33% | 33.78% | 41.50% | 50.00% | 50.00% | | | 双耳效应感知 | 100.00% | 41.38% | ✘ | ✘ | 49.88% | ✘ | | | 响度评估 | 85.63% | ✘ | 49.77% | ✘ | ✘ | 50.13% | | | 语速评估 | 76.25% | ✘ | 50.00% | ✘ | ✘ | 44.93% | | | 语音停顿检测 | 91.88% | ✘ | 50.00% | 49.97% | ✘ | 51.70% | | **L3** | 环境噪声检测 | 91.88% | 45.27% | 50.00% | 60.17% | 49.88% | 50.00% | | | 声学场景分类 | 90.28% | 16.36% | 5.07% | 12.05% | 20.74% | 27.67% | | | 说话人年龄预测 | 52.59% | 13.43% | 33.60% | ✘ | 36.87% | 38.55% | | | 说话人性别识别 | 97.50% | ✘ | 50.00% | ✘ | 48.12% | 79.60% | | | 语音情感识别 | 50.71% | 16.77% | 9.20% | 3.68% | 10.93% | 79.51% | | | 无伴奏合唱情感识别 | 62.25% | 21.50% | 12.42% | 7.08% | 14.62% | 62.38% | | | 情感强度感知 | 97.50% | 72.67% | 50.00% | 50.00% | 49.29% | 50.00% | | | 情感转译 | 3.68 | 0.32 | ✘ | ✘ | 0.27 | 0.31 | | | 歌唱检测 | 99.38% | 53.11% | 50.00% | 64.82% | 56.47% | 50.22% | | **L4** | 新冠风险检测 | 60.63% | ✘ | ✘ | ✘ | 50.00% | 14.17% | | | 咳嗽类型分类 | 52.50% | 40.33% | 50.16% | 44.17% | 49.17% | 43.39% | | | 咳嗽病因诊断 | 32.19% | ✘ | ✘ | ✘ | 4.01% | 25.65% | | | 咳嗽严重程度评估 | 45.42% | 24.12% | 30.85% | 28.50% | 38.24% | 33.86% | | | 肺部风险筛查 | 49.38% | ✘ | 47.62% | ✘ | ✘ | 50.16% | | **L5** | 英语口语教练| 1.39 | 0.15 | 1.29 | 0.44 | 0.48 | 0.54 | | | 语音侦探 | 1.20 | ✘ | 0.84 | 0.83 | 0.86 | 1.24 | **备注**： - 「✘」表示模型未遵循测试指令。 - 「*」代表该指标为词错误率（Word Error Rate, WER）及同类指标，数值越低则性能越好。

应用场景：