changelinglab/speechocean-l2eval

Name: changelinglab/speechocean-l2eval
Creator: changelinglab
Published: 2026-01-26 14:35:52
License: 暂无描述

Hugging Face2026-01-26 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/changelinglab/speechocean-l2eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 arxiv: 2601.14046 dataset_info: features: - name: audio dtype: audio: sampling_rate: 16000 - name: speaker_id dtype: string - name: utt_id dtype: string - name: text dtype: string - name: accuracy dtype: int32 - name: completeness dtype: float32 - name: fluency dtype: int32 - name: prosodic dtype: int32 - name: total dtype: int32 splits: - name: train num_bytes: 260979874 num_examples: 2260 - name: val num_bytes: 37136358 num_examples: 240 - name: test num_bytes: 288161567 num_examples: 2500 download_size: 610453123 dataset_size: 586277799 configs: - config_name: default data_files: - split: train path: data/train-* - split: val path: data/val-* - split: test path: data/test-* task_categories: - automatic-speech-recognition language: - en size_categories: - 1K<n<10K --- # speechocean762: A non-native English corpus for pronunciation scoring task ## Dataset Summary **speechocean762** is an open-source non-native English speech corpus designed for **pronunciation assessment** and **L2 spoken proficiency modeling**. This Hugging Face version provides **sentence-level audio and expert scores**, organized into standard `train` / `validation` / `test` splits. All speakers are Mandarin L1 learners of English, spanning both children and adults. Each utterance is evaluated independently by five expert annotators using standardized pronunciation metrics. This dataset is suitable for: - pronunciation scoring - L2 speech assessment - speech representation learning - downstream regression or classification tasks ## Dataset Structure ### Splits The dataset is published with three predefined splits: - `train` (2260) - `val` (240) - `test` (2500) Splits are **speaker-disjoint** and provided as native Hugging Face splits. ### Features Each example contains: | Field | Type | Description | |--------|------|-------------| | `audio` | `Audio` | Speech waveform (16 kHz) | | `speaker_id` | `string` | Speaker identifier | | `utt_id` | `string` | Utterance identifier | | `text` | `string` | Prompt sentence | | `accuracy` | `int` | Sentence-level pronunciation accuracy | | `completeness` | `float` | Percentage of correctly pronounced words | | `fluency` | `int` | Sentence-level fluency score | | `prosodic` | `int` | Sentence-level prosody score | | `total` | `int` | Overall pronunciation score | ## Scoring Metrics (Sentence level) All sentence-level scores follow the original speechocean762 definitions. For detailed descriptions, see: - **arXiv:** https://arxiv.org/abs/2104.01378 - **Github:** https://github.com/jimbozhang/speechocean762 ## Dataset Creation This Hugging Face dataset is derived from the original speechocean762 corpus and includes: - sentence-level audio - sentence-level expert scores - standardized HF Audio features - speaker-disjoint train/val/test splits Word-level and phoneme-level annotations are not included in this version. **Source Dataset**: https://huggingface.co/datasets/mispeech/speechocean762 ## License The original speechocean762 dataset is released for free use, including commercial and non-commercial purposes, as stated by the original authors. Users should consult the original repository for full licensing details. ## Citation If you use this dataset, please cite the original paper: ```bibtex @inproceedings{zhang2021speechocean762, title={speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment}, author={Zhang, Junbo and Zhang, Zhiwen and Wang, Yongqing and Yan, Zhiyong and Song, Qiong and Huang, Yukai and Li, Ke and Povey, Daniel and Wang, Yujun}, booktitle={Proc. Interspeech 2021}, year={2021} } ``` ## Acknowledgements All credit for data collection and annotation belongs to the original speechocean762 authors. This Hugging Face release focuses on standardized access and reproducibility for modern speech and representation learning pipelines. You can use this dataset with our benchmarking toolkit at https://github.com/changelinglab/prism ``` @misc{prism2026, title={PRiSM: Benchmarking Phone Realization in Speech Models}, author={Shikhar Bharadwaj and Chin-Jou Li and Yoonjae Kim and Kwanghee Choi and Eunjung Yeo and Ryan Soh-Eun Shim and Hanyu Zhou and Brendon Boldt and Karen Rosero Jacome and Kalvin Chang and Darsh Agrawal and Keer Xu and Chao-Han Huck Yang and Jian Zhu and Shinji Watanabe and David R. Mortensen}, year={2026}, eprint={2601.14046}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2601.14046}, } ```

提供机构：

changelinglab

5,000+

优质数据集

54 个

任务类型

进入经典数据集