five

bosonai/WildASR

收藏
Hugging Face2026-04-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bosonai/WildASR
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - automatic-speech-recognition language: - en tags: - asr - robustness - benchmark - out-of-distribution - hallucination - speech pretty_name: WildASR size_categories: - 1K<n<10K --- # WildASR Official dataset for **Back to Basics: Revisiting ASR in the Age of Voice Agents**. Code: [github.com/boson-ai/WildASR-public](https://github.com/boson-ai/WildASR-public) ## Overview WildASR is a multilingual diagnostic benchmark built from **real human speech** to stress-test ASR robustness under real-world out-of-distribution (OOD) conditions. We decompose robustness into three axes: - **Environmental Degradation** (the *where*): reverberation, far-field, phone codec, noise gap, clipping - **Demographic Shift** (the *who*): children, older adults, accented speech - **Linguistic Diversity** (the *what*): short utterances, incomplete audio, code-switching ## Dataset Due to licensing constraints, we currently release 7 splits covering environment degradation (clean, clipping, far-field, noise gap, phone codec, reverberation) and demographic shift (accent). 10,058 samples, ~30 hours total. Each sample contains `audio` (16kHz WAV), `transcript`, and metadata (`category`, `subset`, `language`, etc.). More splits and languages will be added as licenses are cleared. ## Usage ```python from datasets import load_dataset # Load all splits ds = load_dataset("bosonai/WildASR") # Load a specific split clean = load_dataset("bosonai/WildASR", split="environment_degradation__en__fleurs_clean_en") # Play audio (in a notebook) clean[0]["audio"] ``` ### Run evaluation with WildASR toolkit ```bash pip install git+https://github.com/boson-ai/WildASR-public.git # Save a split as parquet for the eval toolkit clean.to_parquet("data/fleurs_clean.parquet") ``` ```python from run_eval.eval import create_client, run_asr_evaluation, ASREvalConfig client = create_client("whisper-large-v3", "en") cfg = ASREvalConfig( model_name="whisper-large-v3", data_path="data/fleurs_clean.parquet", output_dir="results/whisper-large-v3", language="en", wer_method="qwen", ) run_asr_evaluation(client=client, config=cfg) ``` ## Citation ```bibtex @misc{wildasr2026, title = {Back to Basics: Revisiting ASR in the Age of Voice Agents}, author = {Geeyang Tay and Wentao Ma and Jaewon Lee and Yuzhi Tang and Daniel Lee and Weisu Yin and Dongming Shen and Silin Meng and Yi Zhu and Mu Li and Alex Smola}, year = {2026}, note = {arXiv:2603.25727} } ``` ## License Apache 2.0
提供机构:
bosonai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作