nyrahealth/disfluency_speech_english

Name: nyrahealth/disfluency_speech_english
Creator: nyrahealth
Published: 2026-04-07 13:41:48
License: 暂无描述

Hugging Face2026-04-07 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/nyrahealth/disfluency_speech_english

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 task_categories: - automatic-speech-recognition pretty_name: Nyra Disfluency Speech English size_categories: - 1K<n<10K --- # Nyra Disfluency Speech English `nyrahealth/disfluency_speech_english` is an English speech dataset for evaluating **verbatim ASR**: models that should transcribe not only the intended words, but also fillers, cutoffs, repetitions, and sound events. This dataset is based on the [AMAAI Lab DisfluencySpeech dataset](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech) and reformatted for verbatim-transcription benchmarking with paired: - `verbatim_transcript`: what the speaker actually said - `intended_transcript`: a cleaned version of what the speaker meant to say It is used by the [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark), which evaluates verbatim ASR in detail and breaks errors down into fillers, sounds, cutoffs, repetitions, and intended-transcript failures. For the exact convention definitions used by the benchmark, see: - [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark) - [Verbatim transcript conventions](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark?tab=readme-ov-file#verbatim-transcript-conventions) ## Source This release is derived from: - Dataset: [amaai-lab/DisfluencySpeech](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech) - Paper: [DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage](https://arxiv.org/abs/2406.08820) The original dataset provides annotated transcripts and several progressively cleaned transcript variants. This Nyra release converts the data into a format that is directly usable for verbatim-ASR evaluation with paired verbatim and intended references. ## Dataset Structure The dataset contains `4,957` utterances and about `9.4` hours of audio. Splits: - `train`: `4,458` - `validation`: `250` - `test`: `249` Features: - `id` - `audio` - `duration_in_s` - `split` - `speaker` - `verbatim_transcript` - `intended_transcript` - `timings` - `verbatim_timings` ## Transcription Conventions ### Verbatim Transcript The `verbatim_transcript` follows a small set of explicit conventions so disfluencies and non-speech events can be evaluated consistently: - **Cutoffs** use `*`, for example `th*` or `w*` - **Fillers** are bracketed tags, primarily `[UH]` and `[UM]` - **Sound events** are also bracketed tags, for example `[laughter]`, `[breath]`, or `[cough]` - **Repetitions** are written as repeated words, not as separate tags - Spoken words are otherwise written as they were said Example: ```text I mean we we [UH] should go on th* Thursday [laughter] ``` ### Intended Transcript The `intended_transcript` is the cleaned target for intended ASR. It removes disfluent material while preserving the speaker's meaning, including fillers, sound tags, repeated restarts, and cutoff fragments. Example: ```text verbatim: I mean we we [UH] should go on th* Thursday [laughter] intended: we should go on Thursday ``` This makes the dataset suitable for evaluating both: - verbatim transcription quality - intended transcription quality ## Tag Analysis The counts below were computed over the full dataset from `verbatim_transcript`. Summary: - utterances: `4,957` - utterances with at least one bracketed tag or cutoff: `2,779` - total bracketed tags: `4,039` - total cutoff tokens: `582` ### Fillers | Tag | Count | | --- | ---: | | `[UH]` | 2,568 | | `[UM]` | 504 | ### Sound Tags | Tag | Count | | --- | ---: | | `[laughter]` | 714 | | `[breath]` | 105 | | `[lipsmack]` | 59 | | `[throatclearing]` | 55 | | `[sigh]` | 18 | | `[sniff]` | 12 | | `[cough]` | 4 | ### Cutoffs | Marker | Count | | --- | ---: | | `*` cutoff tokens | 582 | These statistics are useful when interpreting benchmark results: fillers are common, laughter is the most frequent sound event, and cutoffs occur often enough to matter as a separate evaluation category. ## Benchmark Usage This dataset is designed to be used with the [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark). That benchmark: - derives gold disfluency labels automatically from the verbatim and intended transcript pair - computes transcript metrics such as `vWER` and `iWER` - computes event metrics for fillers, sounds, cutoffs, and repetitions - provides detailed error analysis for verbatim ASR models ## Citation If you use this dataset, please cite the original DisfluencySpeech paper: ```bibtex @misc{wang2024disfluencyspeechsinglespeakerconversational, title={DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage}, author={Kyra Wang and Dorien Herremans}, year={2024}, eprint={2406.08820}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2406.08820} } ```

提供机构：

nyrahealth

5,000+

优质数据集

54 个

任务类型

进入经典数据集