nyrahealth/disfluency_speech_german
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nyrahealth/disfluency_speech_german
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- de
license: unknown
task_categories:
- automatic-speech-recognition
pretty_name: Nyra Disfluency Speech German
size_categories:
- n<1K
---
# Nyra Disfluency Speech German
`nyrahealth/disfluency_speech_german` is a German speech dataset for evaluating **verbatim ASR**: models that should transcribe not only the intended words, but also fillers, cutoffs, repetitions, and sound events.
This dataset was recorded **in-house** by two Nyra researchers, **Berns** and **Laurin**, with the goal of producing natural disfluent German speech similar in spirit to the English [AMAAI Lab DisfluencySpeech dataset](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech).
Like the English release, it is formatted for verbatim-transcription benchmarking with paired:
- `verbatim_transcript`: what the speaker actually said
- `intended_transcript`: a cleaned version of what the speaker meant to say
It is used by the [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark), which evaluates verbatim ASR in detail and breaks errors down into fillers, sounds, cutoffs, repetitions, and intended-transcript failures.
For the exact convention definitions used by the benchmark, see:
- [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark)
- [Verbatim transcript conventions](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark?tab=readme-ov-file#verbatim-transcript-conventions)
## Background
This German dataset was designed as a companion to the English verbatim benchmark data.
The English reference point is:
- Dataset: [amaai-lab/DisfluencySpeech](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)
- Paper: [DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage](https://arxiv.org/abs/2406.08820)
The German release is **not** part of that original dataset. Instead, it is an in-house Nyra dataset that follows the same general idea: paired verbatim and intended transcripts for detailed evaluation of disfluent speech transcription.
## Dataset Structure
The dataset contains `202` utterances and about `0.95` hours of audio.
Splits:
- `test`: `202`
Features:
- `id`
- `audio`
- `duration_in_s`
- `split`
- `speaker`
- `language`
- `verbatim_transcript`
- `intended_transcript`
- `timings`
- `verbatim_timings`
## Transcription Conventions
### Verbatim Transcript
The `verbatim_transcript` follows a small set of explicit conventions so disfluencies and non-speech events can be evaluated consistently:
- **Cutoffs** use `*`, for example `w*`, `d*`, or `bru*`
- **Fillers** are bracketed tags, primarily `[UH]` and `[UM]`
- **Sound events** are also bracketed tags, for example `[lipsmack]`, `[throatclearing]`, `[laughter]`, or `[cough]`
- **Repetitions** are written as repeated words, not as separate tags
- Spoken words are otherwise written as they were said
Example:
```text
Also, [UM] ich denke, dass [lipsmack] wir vielleicht [UH] nächste Woche, [UM] ich meine am Wochenende, einen Ausflug machen könnten, weil das w* w* Wetter ganz gut aussieht.
```
### Intended Transcript
The `intended_transcript` is the cleaned target for intended ASR. It removes disfluent material while preserving the speaker's meaning, including fillers, sound tags, repeated restarts, and cutoff fragments.
Example:
```text
verbatim: Also, [UM] ich denke, dass [lipsmack] wir vielleicht [UH] nächste Woche, [UM] ich meine am Wochenende, einen Ausflug machen könnten, weil das w* w* Wetter ganz gut aussieht.
intended: Also, ich denke, dass wir vielleicht am Wochenende einen Ausflug machen könnten, weil das Wetter ganz gut aussieht.
```
This makes the dataset suitable for evaluating both:
- verbatim transcription quality
- intended transcription quality
## Tag Analysis
The counts below were computed over the full dataset from `verbatim_transcript`.
Summary:
- utterances: `202`
- utterances with at least one bracketed tag or cutoff: `202`
- total bracketed tags: `846`
- total cutoff tokens: `394`
### Fillers
| Tag | Count |
| --- | ---: |
| `[UH]` | 348 |
| `[UM]` | 266 |
### Sound Tags
| Tag | Count |
| --- | ---: |
| `[throatclearing]` | 57 |
| `[laughter]` | 57 |
| `[lipsmack]` | 50 |
| `[cough]` | 21 |
| `[sniff]` | 16 |
| `[breath]` | 13 |
| `[yawn]` | 11 |
| `[sigh]` | 5 |
| `[noise]` | 2 |
### Cutoffs
| Marker | Count |
| --- | ---: |
| `*` cutoff tokens | 394 |
These statistics show that the German set is deliberately dense in disfluencies: every utterance contains at least one annotated event or cutoff, fillers are very frequent, and cutoff fragments occur often enough to be a core evaluation category.
## Benchmark Usage
This dataset is designed to be used with the [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark).
That benchmark:
- derives gold disfluency labels automatically from the verbatim and intended transcript pair
- computes transcript metrics such as `vWER` and `iWER`
- computes event metrics for fillers, sounds, cutoffs, and repetitions
- provides detailed error analysis for verbatim ASR models
## Citation
If you use this dataset, please cite the benchmark repository and describe that the German recordings were collected in-house by Nyra researchers Berns and Laurin as a German companion set for verbatim-ASR evaluation.
提供机构:
nyrahealth



