bosonai/WildASR

Name: bosonai/WildASR
Creator: bosonai
Published: 2026-04-13 20:10:57
License: 暂无描述

Hugging Face2026-04-13 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/bosonai/WildASR

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - automatic-speech-recognition language: - en tags: - asr - robustness - benchmark - out-of-distribution - hallucination - speech pretty_name: WildASR size_categories: - 1K<n<10K --- # WildASR Official dataset for **Back to Basics: Revisiting ASR in the Age of Voice Agents**. Code: [github.com/boson-ai/WildASR-public](https://github.com/boson-ai/WildASR-public) ## Overview WildASR is a multilingual diagnostic benchmark built from **real human speech** to stress-test ASR robustness under real-world out-of-distribution (OOD) conditions. We decompose robustness into three axes: - **Environmental Degradation** (the *where*): reverberation, far-field, phone codec, noise gap, clipping - **Demographic Shift** (the *who*): children, older adults, accented speech - **Linguistic Diversity** (the *what*): short utterances, incomplete audio, code-switching ## Dataset Due to licensing constraints, we currently release 7 splits covering environment degradation (clean, clipping, far-field, noise gap, phone codec, reverberation) and demographic shift (accent). 10,058 samples, ~30 hours total. Each sample contains `audio` (16kHz WAV), `transcript`, and metadata (`category`, `subset`, `language`, etc.). More splits and languages will be added as licenses are cleared. ## Usage ```python from datasets import load_dataset # Load all splits ds = load_dataset("bosonai/WildASR") # Load a specific split clean = load_dataset("bosonai/WildASR", split="environment_degradation__en__fleurs_clean_en") # Play audio (in a notebook) clean[0]["audio"] ``` ### Run evaluation with WildASR toolkit ```bash pip install git+https://github.com/boson-ai/WildASR-public.git # Save a split as parquet for the eval toolkit clean.to_parquet("data/fleurs_clean.parquet") ``` ```python from run_eval.eval import create_client, run_asr_evaluation, ASREvalConfig client = create_client("whisper-large-v3", "en") cfg = ASREvalConfig( model_name="whisper-large-v3", data_path="data/fleurs_clean.parquet", output_dir="results/whisper-large-v3", language="en", wer_method="qwen", ) run_asr_evaluation(client=client, config=cfg) ``` ## Citation ```bibtex @misc{wildasr2026, title = {Back to Basics: Revisiting ASR in the Age of Voice Agents}, author = {Geeyang Tay and Wentao Ma and Jaewon Lee and Yuzhi Tang and Daniel Lee and Weisu Yin and Dongming Shen and Silin Meng and Yi Zhu and Mu Li and Alex Smola}, year = {2026}, note = {arXiv:2603.25727} } ``` ## License Apache 2.0

提供机构：

bosonai

5,000+

优质数据集

54 个

任务类型

进入经典数据集