five

nyrahealth/disfluency_speech_german

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nyrahealth/disfluency_speech_german
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - de license: unknown task_categories: - automatic-speech-recognition pretty_name: Nyra Disfluency Speech German size_categories: - n<1K --- # Nyra Disfluency Speech German `nyrahealth/disfluency_speech_german` is a German speech dataset for evaluating **verbatim ASR**: models that should transcribe not only the intended words, but also fillers, cutoffs, repetitions, and sound events. This dataset was recorded **in-house** by two Nyra researchers, **Berns** and **Laurin**, with the goal of producing natural disfluent German speech similar in spirit to the English [AMAAI Lab DisfluencySpeech dataset](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech). Like the English release, it is formatted for verbatim-transcription benchmarking with paired: - `verbatim_transcript`: what the speaker actually said - `intended_transcript`: a cleaned version of what the speaker meant to say It is used by the [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark), which evaluates verbatim ASR in detail and breaks errors down into fillers, sounds, cutoffs, repetitions, and intended-transcript failures. For the exact convention definitions used by the benchmark, see: - [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark) - [Verbatim transcript conventions](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark?tab=readme-ov-file#verbatim-transcript-conventions) ## Background This German dataset was designed as a companion to the English verbatim benchmark data. The English reference point is: - Dataset: [amaai-lab/DisfluencySpeech](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech) - Paper: [DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage](https://arxiv.org/abs/2406.08820) The German release is **not** part of that original dataset. Instead, it is an in-house Nyra dataset that follows the same general idea: paired verbatim and intended transcripts for detailed evaluation of disfluent speech transcription. ## Dataset Structure The dataset contains `202` utterances and about `0.95` hours of audio. Splits: - `test`: `202` Features: - `id` - `audio` - `duration_in_s` - `split` - `speaker` - `language` - `verbatim_transcript` - `intended_transcript` - `timings` - `verbatim_timings` ## Transcription Conventions ### Verbatim Transcript The `verbatim_transcript` follows a small set of explicit conventions so disfluencies and non-speech events can be evaluated consistently: - **Cutoffs** use `*`, for example `w*`, `d*`, or `bru*` - **Fillers** are bracketed tags, primarily `[UH]` and `[UM]` - **Sound events** are also bracketed tags, for example `[lipsmack]`, `[throatclearing]`, `[laughter]`, or `[cough]` - **Repetitions** are written as repeated words, not as separate tags - Spoken words are otherwise written as they were said Example: ```text Also, [UM] ich denke, dass [lipsmack] wir vielleicht [UH] nächste Woche, [UM] ich meine am Wochenende, einen Ausflug machen könnten, weil das w* w* Wetter ganz gut aussieht. ``` ### Intended Transcript The `intended_transcript` is the cleaned target for intended ASR. It removes disfluent material while preserving the speaker's meaning, including fillers, sound tags, repeated restarts, and cutoff fragments. Example: ```text verbatim: Also, [UM] ich denke, dass [lipsmack] wir vielleicht [UH] nächste Woche, [UM] ich meine am Wochenende, einen Ausflug machen könnten, weil das w* w* Wetter ganz gut aussieht. intended: Also, ich denke, dass wir vielleicht am Wochenende einen Ausflug machen könnten, weil das Wetter ganz gut aussieht. ``` This makes the dataset suitable for evaluating both: - verbatim transcription quality - intended transcription quality ## Tag Analysis The counts below were computed over the full dataset from `verbatim_transcript`. Summary: - utterances: `202` - utterances with at least one bracketed tag or cutoff: `202` - total bracketed tags: `846` - total cutoff tokens: `394` ### Fillers | Tag | Count | | --- | ---: | | `[UH]` | 348 | | `[UM]` | 266 | ### Sound Tags | Tag | Count | | --- | ---: | | `[throatclearing]` | 57 | | `[laughter]` | 57 | | `[lipsmack]` | 50 | | `[cough]` | 21 | | `[sniff]` | 16 | | `[breath]` | 13 | | `[yawn]` | 11 | | `[sigh]` | 5 | | `[noise]` | 2 | ### Cutoffs | Marker | Count | | --- | ---: | | `*` cutoff tokens | 394 | These statistics show that the German set is deliberately dense in disfluencies: every utterance contains at least one annotated event or cutoff, fillers are very frequent, and cutoff fragments occur often enough to be a core evaluation category. ## Benchmark Usage This dataset is designed to be used with the [Nyra Verbatim Speech Benchmark](https://github.com/nyrahealth/nyra_verbatim_speech_benchmark). That benchmark: - derives gold disfluency labels automatically from the verbatim and intended transcript pair - computes transcript metrics such as `vWER` and `iWER` - computes event metrics for fillers, sounds, cutoffs, and repetitions - provides detailed error analysis for verbatim ASR models ## Citation If you use this dataset, please cite the benchmark repository and describe that the German recordings were collected in-house by Nyra researchers Berns and Laurin as a German companion set for verbatim-ASR evaluation.
提供机构:
nyrahealth
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作