nyrahealth/disfluency_speech_english
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nyrahealth/disfluency_speech_english
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- automatic-speech-recognition
pretty_name: Nyra Disfluency Speech English
size_categories:
- 1K<n<10K
---
# Nyra Disfluency Speech English
`nyrahealth/disfluency_speech_english` is an English speech dataset for evaluating **verbatim ASR**: models that should transcribe not only the intended words, but also fillers, cutoffs, repetitions, and sound events.
This dataset is based on the [AMAAI Lab DisfluencySpeech dataset](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech) and reformatted for verbatim-transcription benchmarking with paired:
- `verbatim_transcript`: what the speaker actually said
- `intended_transcript`: a cleaned version of what the speaker meant to say
It is used by the [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark), which evaluates verbatim ASR in detail and breaks errors down into fillers, sounds, cutoffs, repetitions, and intended-transcript failures.
For the exact convention definitions used by the benchmark, see:
- [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark)
- [Verbatim transcript conventions](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark?tab=readme-ov-file#verbatim-transcript-conventions)
## Source
This release is derived from:
- Dataset: [amaai-lab/DisfluencySpeech](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)
- Paper: [DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage](https://arxiv.org/abs/2406.08820)
The original dataset provides annotated transcripts and several progressively cleaned transcript variants. This Nyra release converts the data into a format that is directly usable for verbatim-ASR evaluation with paired verbatim and intended references.
## Dataset Structure
The dataset contains `4,957` utterances and about `9.4` hours of audio.
Splits:
- `train`: `4,458`
- `validation`: `250`
- `test`: `249`
Features:
- `id`
- `audio`
- `duration_in_s`
- `split`
- `speaker`
- `verbatim_transcript`
- `intended_transcript`
- `timings`
- `verbatim_timings`
## Transcription Conventions
### Verbatim Transcript
The `verbatim_transcript` follows a small set of explicit conventions so disfluencies and non-speech events can be evaluated consistently:
- **Cutoffs** use `*`, for example `th*` or `w*`
- **Fillers** are bracketed tags, primarily `[UH]` and `[UM]`
- **Sound events** are also bracketed tags, for example `[laughter]`, `[breath]`, or `[cough]`
- **Repetitions** are written as repeated words, not as separate tags
- Spoken words are otherwise written as they were said
Example:
```text
I mean we we [UH] should go on th* Thursday [laughter]
```
### Intended Transcript
The `intended_transcript` is the cleaned target for intended ASR. It removes disfluent material while preserving the speaker's meaning, including fillers, sound tags, repeated restarts, and cutoff fragments.
Example:
```text
verbatim: I mean we we [UH] should go on th* Thursday [laughter]
intended: we should go on Thursday
```
This makes the dataset suitable for evaluating both:
- verbatim transcription quality
- intended transcription quality
## Tag Analysis
The counts below were computed over the full dataset from `verbatim_transcript`.
Summary:
- utterances: `4,957`
- utterances with at least one bracketed tag or cutoff: `2,779`
- total bracketed tags: `4,039`
- total cutoff tokens: `582`
### Fillers
| Tag | Count |
| --- | ---: |
| `[UH]` | 2,568 |
| `[UM]` | 504 |
### Sound Tags
| Tag | Count |
| --- | ---: |
| `[laughter]` | 714 |
| `[breath]` | 105 |
| `[lipsmack]` | 59 |
| `[throatclearing]` | 55 |
| `[sigh]` | 18 |
| `[sniff]` | 12 |
| `[cough]` | 4 |
### Cutoffs
| Marker | Count |
| --- | ---: |
| `*` cutoff tokens | 582 |
These statistics are useful when interpreting benchmark results: fillers are common, laughter is the most frequent sound event, and cutoffs occur often enough to matter as a separate evaluation category.
## Benchmark Usage
This dataset is designed to be used with the [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark).
That benchmark:
- derives gold disfluency labels automatically from the verbatim and intended transcript pair
- computes transcript metrics such as `vWER` and `iWER`
- computes event metrics for fillers, sounds, cutoffs, and repetitions
- provides detailed error analysis for verbatim ASR models
## Citation
If you use this dataset, please cite the original DisfluencySpeech paper:
```bibtex
@misc{wang2024disfluencyspeechsinglespeakerconversational,
title={DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage},
author={Kyra Wang and Dorien Herremans},
year={2024},
eprint={2406.08820},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2406.08820}
}
```
提供机构:
nyrahealth



