krutrim-ai-labs/VoiceAgentBench

Name: krutrim-ai-labs/VoiceAgentBench
Creator: krutrim-ai-labs
Published: 2026-02-16 05:47:10
License: 暂无描述

Hugging Face2026-02-16 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/krutrim-ai-labs/VoiceAgentBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_name: krutrim-community-license-agreement-version-1.0 license_link: LICENSE.md pretty_name: VoiceAgentBench language: - en - hi - bn - ta - te - ml - mr tags: - audio - speech - tool-calling - function-calling - benchmark --- # VoiceAgentBench This repository contains dataset for **VoiceAgentBench**, a large-scale speech benchmark introduced in **“VoiceAgentBench: Are Voice Assistants Ready for Agentic Tasks?”** ([arXiv:2510.07978](https://arxiv.org/abs/2510.07978)). VoiceAgentBench is designed to evaluate **end-to-end speech-based agents** in realistic, tool-driven settings. Unlike prior speech benchmarks that focus on transcription, intent detection, and speech question answering, this benchmark targets **agentic reasoning from speech input**, requiring models to select appropriate tools, generate structured arguments, orchestrate multi-step workflows, and handle safety-critical requests. The dataset consists of **multilingual spoken queries** paired with explicit tool/function specifications and expected tool-call outputs, covering **single- and multi-tool usage**, **sequentially dependent and parallel tool orchestration**, **multi-turn spoken dialogues**, and **unsafe user requests requiring correct refusal behavior**. VoiceAgentBench enables systematic evaluation of both **ASR–LLM pipelines** and **end-to-end SpeechLMs**, highlighting the gap between text-based agents and their speech-based counterparts. ### Repository layout All benchmark assets live under the top-level `VoiceAgentBench/` directory: - `VoiceAgentBench/*_data/**.json`: queries / audio paths / instructions/ expected tool calls - `VoiceAgentBench/*_audios/**.wav`: corresponding audio files In each JSON file, the `path` field is **repo-relative** (e.g. `VoiceAgentBench/single_tool_audios/english/1_audio.wav`). ### Subsets - **single_tool:** Single tool-call tasks involving simple parameter filling from a spoken query, given a predefined tool. - **single_tool_retrieval:** Tasks requiring selection of the relevant tool from a tool list, followed by parameter filling based on the spoken query. - **parallel_tool:** Tasks that require selecting and invoking multiple independent tools in parallel from a provided tool list. - **seqdep_tool:** Tasks involving chained, sequential tool invocations selected from a tool list. - **multi_turn:** Dialog-based tool invocation tasks, where a single tool call must be produced based on information accumulated over multiple spoken interaction turns. - **safety:** Safety evaluation tasks that involve rejecting adversarial or unsafe spoken queries and avoiding unsafe or hallucinated tool invocations. ### Data format (common patterns) Depending on the subset, each item may include: - `id`: example id - `query` / `user_request`: the text query - `functions`: tool/function specs (or list of tool names in safety) - `expected_tool_call`: expected tool invocation(s) and arguments (when applicable) - `path`: relative audio path (wav) - `duration`: duration in seconds - `instruction`: system prompt template - `chat_history`: (multi_turn only) list of turns; user turns include `path` and `duration` ### Using the data You can read JSON directly, or use `datasets`. Example: ```python from datasets import load_dataset, Audio from huggingface_hub import hf_hub_download # 1) Download JSON json_path = hf_hub_download( repo_id="krutrim-ai-labs/VoiceAgentBench", repo_type="dataset", filename="data/single_tool_data/english/single_tool_english.json", # showing this as an example for single tool calling ) ds = load_dataset("json", data_files=json_path, split="train") # 2) Download each audio and replace path def download_and_replace(example): local_path = hf_hub_download( repo_id="krutrim-ai-labs/VoiceAgentBench", repo_type="dataset", filename=example["path"], ) example["path"] = local_path return example ds = ds.map(download_and_replace) # 3) Cast column ds = ds.cast_column("path", Audio()) # 4) Test print(ds[0]["path"]) ``` ## Code & Evaluation The official inference and evaluation codebase for **VoiceAgentBench** is available on GitHub: **GitHub Repository:** [https://github.com/ola-krutrim/VoiceAgentBench](https://github.com/ola-krutrim/VoiceAgentBench) The repository includes: * End-to-end inference pipelines for SpeechLMs and ASR–LLM systems * Structured tool-call parsing and normalization * LLM-as-a-judge evaluation for: * Parameter correctness * Multi-tool orchestration * Sequential dependencies * Multi-turn reasoning * Safety & refusal behavior * Reproducible evaluation scripts across all benchmark subsets * Modular interface enabling easy integration of new SpeechLMs ### License This repository is licensed under the Krutrim Community License. ### Citation If you use this dataset, please cite: ```bibtex @article{jain2025voiceagentbench, title={VoiceAgentBench: Are Voice Assistants ready for agentic tasks?}, author={Dhruv Jain and Harshit Shukla and Gautam Rajeev and Ashish Kulkarni and Chandra Khatri and Shubham Agarwal}, year={2025}, eprint={2510.07978}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.07978}, } ```

提供机构：

krutrim-ai-labs

5,000+

优质数据集

54 个

任务类型

进入经典数据集