five

StepEval-Audio-Toolcall

收藏
魔搭社区2026-01-02 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/stepfun-ai/StepEval-Audio-Toolcall
下载链接
链接失效反馈
官方服务:
资源简介:
# StepEval-Audio-Toolcall Paper: [Step-Audio 2 Technical Report](https://huggingface.co/papers/2507.16632) Code: https://github.com/stepfun-ai/Step-Audio2 Project Page: https://www.stepfun.com/docs/en/step-audio2 ## Dataset Description StepEval Audio Toolcall evaluates the invocation performance of four tool types. For each tool, the benchmark contains approximately 200 multi-turn dialogue sets for both positive and negative scenarios: - Positive samples: The assistant is required to invoke the specified tool in the final turn; - Negative samples: The assistant must avoid invoking the specified tool This structure enables comprehensive evaluation of: - Trigger metrics: Precision and recall for tool invocation trigger; - Type accuracy: Accuracy of tool type selection during valid invocations; - Parameter accuracy: Evaluated separately through LLM-judge labeling of tool parameters ## Evaluation The script (LLM_judge.py) evaluates the rationality of parameters and counts all the metrics mentioned above. ## Benchmark Results on StepEval-Audio-Toolcall Comparison between Step-Audio 2 and Qwen3-32B on tool calling. Qwen3-32B is evaluated with text inputs. Date and time tools have no parameter. <table border="1" cellpadding="5" cellspacing="0" align="center"> <thead> <tr> <th style="text-align: center;">Model</th> <th style="text-align: center;">Objective</th> <th style="text-align: center;">Metric</th> <th style="text-align: center;">Audio search</th> <th style="text-align: center;">Date & Time</th> <th style="text-align: center;">Weather</th> <th style="text-align: center;">Web search</th> </tr> </thead> <tbody> <tr> <td style="text-align: center; vertical-align: middle;" rowspan="3"><strong>Qwen3-32B</strong></td> <td align="center"><strong>Trigger</strong></td> <td align="center"><strong>Precision / Recall</strong></td> <td align="center">67.5 / 98.5</td> <td align="center">98.4 / 100.0</td> <td align="center">90.1 / 100.0</td> <td align="center">86.8 / 98.5</td> </tr> <tr> <td align="center"><strong>Type</strong></td> <td align="center"><strong>Accuracy</strong></td> <td align="center">100.0</td> <td align="center">100.0</td> <td align="center">98.5</td> <td align="center">98.5</td> </tr> <tr> <td align="center"><strong>Parameter</strong></td> <td align="center"><strong>Accuracy</strong></td> <td align="center">100.0</td> <td align="center">N/A</td> <td align="center">100.0</td> <td align="center">100.0</td> </tr> <tr> <td style="text-align: center; vertical-align: middle;" rowspan="3"><strong>Step-Audio 2</strong></td> <td align="center"><strong>Trigger</strong></td> <td align="center"><strong>Precision / Recall</strong></td> <td align="center">86.8 / 99.5</td> <td align="center">96.9 / 98.4</td> <td align="center">92.2 / 100.0</td> <td align="center">88.4 / 95.5</td> </tr> <tr> <td align="center"><strong>Type</strong></td> <td align="center"><strong>Accuracy</strong></td> <td align="center">100.0</td> <td align="center">100.0</td> <td align="center">90.5</td> <td align="center">98.4</td> </tr> <tr> <td align="center"><strong>Parameter</strong></td> <td align="center"><strong>Accuracy</strong></td> <td align="center">100.0</td> <td align="center">N/A</td> <td align="center">100.0</td> <td align="center">100.0</td> </tr> </tbody> </table> You can find the original data files at the address [here](https://huggingface.co/datasets/stepfun-ai/StepEval-Audio-Toolcall/tree/main). ## Citation ``` @misc{wu2025stepaudio2technicalreport, title={Step-Audio 2 Technical Report}, author={Boyong Wu and Chao Yan and Chen Hu and Cheng Yi and Chengli Feng and Fei Tian and Feiyu Shen and Gang Yu and Haoyang Zhang and Jingbei Li and Mingrui Chen and Peng Liu and Wang You and Xiangyu Tony Zhang and Xingyuan Li and Xuerui Yang and Yayue Deng and Yechang Huang and Yuxin Li and Yuxin Zhang and Zhao You and Brian Li and Changyi Wan and Hanpeng Hu and Jiangjie Zhen and Siyu Chen and Song Yuan and Xuelin Zhang and Yimin Jiang and Yu Zhou and Yuxiang Yang and Bingxin Li and Buyun Ma and Changhe Song and Dongqing Pang and Guoqiang Hu and Haiyang Sun and Kang An and Na Wang and Shuli Gao and Wei Ji and Wen Li and Wen Sun and Xuan Wen and Yong Ren and Yuankai Ma and Yufan Lu and Bin Wang and Bo Li and Changxin Miao and Che Liu and Chen Xu and Dapeng Shi and Dingyuan Hu and Donghang Wu and Enle Liu and Guanzhe Huang and Gulin Yan and Han Zhang and Hao Nie and Haonan Jia and Hongyu Zhou and Jianjian Sun and Jiaoren Wu and Jie Wu and Jie Yang and Jin Yang and Junzhe Lin and Kaixiang Li and Lei Yang and Liying Shi and Li Zhou and Longlong Gu and Ming Li and Mingliang Li and Mingxiao Li and Nan Wu and Qi Han and Qinyuan Tan and Shaoliang Pang and Shengjie Fan and Siqi Liu and Tiancheng Cao and Wanying Lu and Wenqing He and Wuxun Xie and Xu Zhao and Xueqi Li and Yanbo Yu and Yang Yang and Yi Liu and Yifan Lu and Yilei Wang and Yuanhao Ding and Yuanwei Liang and Yuanwei Lu and Yuchu Luo and Yuhe Yin and Yumeng Zhan and Yuxiang Zhang and Zidong Yang and Zixin Zhang and Binxing Jiao and Daxin Jiang and Heung-Yeung Shum and Jiansheng Chen and Jing Li and Xiangyu Zhang and Yibo Zhu}, year={2025}, eprint={2507.16632}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.16632}, } ```

# StepEval-Audio-Toolcall 工具调用评测数据集 论文:[《Step-Audio 2 技术报告》](https://huggingface.co/papers/2507.16632) 代码:https://github.com/stepfun-ai/Step-Audio2 项目页面:https://www.stepfun.com/docs/en/step-audio2 ## 数据集描述 StepEval Audio Toolcall 用于评测四类工具的调用性能。针对每类工具,该基准数据集包含约200组多轮对话样本,涵盖正、负两种场景: - 正样本:要求助手在最终对话轮次中调用指定工具; - 负样本:要求助手避免调用指定工具。 该评测架构可全面评估以下三类指标: - 触发指标:工具调用触发的精确率(Precision)与召回率(Recall); - 类型准确率:有效调用时的工具类型选择准确率; - 参数准确率:通过大语言模型评审器(LLM-judge)对工具参数进行单独标注评估。 ## 评测流程 评测脚本(LLM_judge.py)用于评估工具参数的合理性,并统计上述全部评测指标。 ## StepEval-Audio-Toolcall 基准评测结果 本次对比了 Step-Audio 2 与 Qwen3-32B 的工具调用性能,其中 Qwen3-32B 采用文本输入进行评测。日期与时间工具无参数。 <table border="1" cellpadding="5" cellspacing="0" align="center"> <thead> <tr> <th style="text-align: center;">模型</th> <th style="text-align: center;">评测目标</th> <th style="text-align: center;">评测指标</th> <th style="text-align: center;">音频搜索</th> <th style="text-align: center;">日期与时间</th> <th style="text-align: center;">天气</th> <th style="text-align: center;">网页搜索</th> </tr> </thead> <tbody> <tr> <td style="text-align: center; vertical-align: middle;" rowspan="3"><strong>Qwen3-32B</strong></td> <td align="center"><strong>触发</strong></td> <td align="center"><strong>精确率 / 召回率</strong></td> <td align="center">67.5 / 98.5</td> <td align="center">98.4 / 100.0</td> <td align="center">90.1 / 100.0</td> <td align="center">86.8 / 98.5</td> </tr> <tr> <td align="center"><strong>类型</strong></td> <td align="center"><strong>准确率</strong></td> <td align="center">100.0</td> <td align="center">100.0</td> <td align="center">98.5</td> <td align="center">98.5</td> </tr> <tr> <td align="center"><strong>参数</strong></td> <td align="center"><strong>准确率</strong></td> <td align="center">100.0</td> <td align="center">N/A</td> <td align="center">100.0</td> <td align="center">100.0</td> </tr> <tr> <td style="text-align: center; vertical-align: middle;" rowspan="3"><strong>Step-Audio 2</strong></td> <td align="center"><strong>触发</strong></td> <td align="center"><strong>精确率 / 召回率</strong></td> <td align="center">86.8 / 99.5</td> <td align="center">96.9 / 98.4</td> <td align="center">92.2 / 100.0</td> <td align="center">88.4 / 95.5</td> </tr> <tr> <td align="center"><strong>类型</strong></td> <td align="center"><strong>准确率</strong></td> <td align="center">100.0</td> <td align="center">100.0</td> <td align="center">90.5</td> <td align="center">98.4</td> </tr> <tr> <td align="center"><strong>参数</strong></td> <td align="center"><strong>准确率</strong></td> <td align="center">100.0</td> <td align="center">N/A</td> <td align="center">100.0</td> <td align="center">100.0</td> </tr> </tbody> </table> 您可通过[此处](https://huggingface.co/datasets/stepfun-ai/StepEval-Audio-Toolcall/tree/main)获取原始数据集文件。 ## 引用格式 @misc{wu2025stepaudio2technicalreport, title={Step-Audio 2 Technical Report}, author={Boyong Wu and Chao Yan and Chen Hu and Cheng Yi and Chengli Feng and Fei Tian and Feiyu Shen and Gang Yu and Haoyang Zhang and Jingbei Li and Mingrui Chen and Peng Liu and Wang You and Xiangyu Tony Zhang and Xingyuan Li and Xuerui Yang and Yayue Deng and Yechang Huang and Yuxin Li and Yuxin Zhang and Zhao You and Brian Li and Changyi Wan and Hanpeng Hu and Jiangjie Zhen and Siyu Chen and Song Yuan and Xuelin Zhang and Yimin Jiang and Yu Zhou and Yuxiang Yang and Bingxin Li and Buyun Ma and Changhe Song and Dongqing Pang and Guoqiang Hu and Haiyang Sun and Kang An and Na Wang and Shuli Gao and Wei Ji and Wen Li and Wen Sun and Xuan Wen and Yong Ren and Yuankai Ma and Yufan Lu and Bin Wang and Bo Li and Changxin Miao and Che Liu and Chen Xu and Dapeng Shi and Dingyuan Hu and Donghang Wu and Enle Liu and Guanzhe Huang and Gulin Yan and Han Zhang and Hao Nie and Haonan Jia and Hongyu Zhou and Jianjian Sun and Jiaoren Wu and Jie Wu and Jie Yang and Jin Yang and Junzhe Lin and Kaixiang Li and Lei Yang and Liying Shi and Li Zhou and Longlong Gu and Ming Li and Mingliang Li and Mingxiao Li and Nan Wu and Qi Han and Qinyuan Tan and Shaoliang Pang and Shengjie Fan and Siqi Liu and Tiancheng Cao and Wanying Lu and Wenqing He and Wuxun Xie and Xu Zhao and Xueqi Li and Yanbo Yu and Yang Yang and Yi Liu and Yifan Lu and Yilei Wang and Yuanhao Ding and Yuanwei Liang and Yuanwei Lu and Yuchu Luo and Yuhe Yin and Yumeng Zhan and Yuxiang Zhang and Zidong Yang and Zixin Zhang and Binxing Jiao and Daxin Jiang and Heung-Yeung Shum and Jiansheng Chen and Jing Li and Xiangyu Zhang and Yibo Zhu}, year={2025}, eprint={2507.16632}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.16632}, }
提供机构:
maas
创建时间:
2025-07-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作