vnmoorthy/pavo-bench

Name: vnmoorthy/pavo-bench
Creator: vnmoorthy
Published: 2026-04-07 18:27:59
License: 暂无描述

Hugging Face2026-04-07 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/vnmoorthy/pavo-bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - automatic-speech-recognition - text-generation - text-to-speech language: - en tags: - pavo - benchmark - asr - llm - tts - pipeline-routing - voice-assistant - latency - quality - cost - energy pretty_name: PAVO-Bench size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: tier1_statistical path: tier1_statistical_results.json - split: tier1_coupling path: tier1_coupling_results.json - split: tier1_llm_latency path: tier1_llm_latency_results.json - split: tier2_e2e path: tier2_e2e_results.json - split: tier2_cross_dataset path: tier2_cross_dataset_results.json - split: tier2_noise_robustness path: tier2_noise_robustness_results.json - split: tier3_50k_summary path: tier3_50k_summary.json - split: tier3_scaling path: tier3_scaling_results.json - split: component_ablation path: component_ablation_results.json --- # PAVO-Bench: 50K-Turn Benchmark for ASR-LLM-TTS Pipeline Routing **Author:** NarasingaMoorthy VeiluKanthaPerumal, University of Pennsylvania ## Description PAVO-Bench is a comprehensive benchmark suite for evaluating **ASR-LLM-TTS voice pipeline routing** decisions. It provides 50,000 turns of benchmark data designed to measure how well different pipeline configurations balance **latency**, **quality**, **cost**, and **energy consumption** when routing spoken-language queries through cascaded Automatic Speech Recognition (ASR), Large Language Model (LLM), and Text-to-Speech (TTS) components. The benchmark is organized into three tiers of increasing scale and complexity, plus component-level ablation studies. All results were produced on GPU hardware. ## Dataset Files ### Tier 1 -- Unit-Level Validation | File | Description | |------|-------------| | `tier1_statistical_results.json` | Statistical reproducibility results across 5 trials of 1,000 turns each (seeds 42, 123, 456, 789, 1024). Reports mean, std, and 95% confidence intervals for PAVO latency, quality, cost, and energy metrics. | | `tier1_coupling_results.json` | Coupling constraint validation measuring LLM quality degradation as a function of ASR word-error rate (WER 0--20%) using llama3.1:8b. | | `tier1_llm_latency_results.json` | LLM latency profiling for llama3.1:8b across short (50 token), medium (200 token), and long (500 token) generation contexts. Reports total latency, time-to-first-token, and tokens/second. | ### Tier 2 -- Integration-Level Evaluation | File | Description | |------|-------------| | `tier2_e2e_results.json` | End-to-end pipeline measurements for cloud_premium (whisper-large-v3 + llama3.1:8b) and edge_fast (whisper-tiny + gemma2:2b) configurations on 200 LibriSpeech samples. Includes per-stage latency breakdowns, sample ASR outputs, and sample LLM responses. | | `tier2_cross_dataset_results.json` | Cross-dataset ASR evaluation on LibriSpeech and FLEURS for whisper-large-v3 and whisper-tiny models (200 samples each). Reports WER and latency statistics. | | `tier2_noise_robustness_results.json` | ASR robustness under white noise at SNR levels 5--30 dB, plus clean baseline. Reports WER degradation across noise conditions. | ### Tier 3 -- Scale Evaluation | File | Description | |------|-------------| | `tier3_50k_summary.json` | Summary statistics for the full 50,000-turn PAVO-Bench dataset: 40K train / 10K test split, complexity distribution (levels 1--5), generation time, and error rate. | | `tier3_scaling_results.json` | LLM scaling benchmarks across multiple models (gemma2:2b, llama3.1:8b, etc.) with simple, medium, and complex query types. Reports latency, throughput, and real-time suitability. | ### Component Analysis | File | Description | |------|-------------| | `component_ablation_results.json` | Ablation study comparing PAVO-Full, PAVO-NoCoupling, and other ablated configurations. Reports latency, quality, cost, energy, coupling violations, and infeasible percentages. | ## Usage ### Load individual JSON files directly ```python import json from huggingface_hub import hf_hub_download # Download a specific results file path = hf_hub_download( repo_id="<your-username>/pavo-bench", filename="tier3_50k_summary.json", repo_type="dataset", ) with open(path) as f: data = json.load(f) print(f"Total samples: {data['total_samples']}") print(f"Train/Test split: {data['train_samples']}/{data['test_samples']}") ``` ### Download all files ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="<your-username>/pavo-bench", repo_type="dataset", local_dir="./pavo-bench-data", ) ``` ## Benchmark Metrics - **Latency** (ms): End-to-end and per-component response time - **Quality** (0--1): Composite score incorporating ASR accuracy and LLM response quality - **Cost** (USD): Per-turn inference cost - **Energy** (mJ): Per-turn energy consumption - **Coupling violations**: Cases where ASR errors propagate and degrade LLM quality ## Citation If you use PAVO-Bench in your research, please cite: ```bibtex @misc{pavo-bench-2026, author = {VeiluKanthaPerumal, NarasingaMoorthy}, title = {PAVO-Bench: A 50K-Turn Benchmark for ASR-LLM-TTS Pipeline Routing}, year = {2026}, institution = {University of Pennsylvania}, url = {https://huggingface.co/datasets/<your-username>/pavo-bench} } ``` ## License This dataset is released under the [Creative Commons Attribution 4.0 International (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license.

提供机构：

vnmoorthy

5,000+

优质数据集

54 个

任务类型

进入经典数据集