five

aman4014/translated-german-english-asr

收藏
Hugging Face2026-05-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aman4014/translated-german-english-asr
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: audio dtype: audio - name: transcription dtype: string - name: translation dtype: string splits: - name: train_el_tts num_bytes: 702196490.0 num_examples: 495 - name: train_mls_0 num_bytes: 191221465088.0 num_examples: 406042 - name: train_tuda_0 num_bytes: 13599196098.12 num_examples: 42812 - name: train_cv19_0 num_bytes: 245838579828.326 num_examples: 546561 - name: train_emilia_yodas0 num_bytes: 955252797694.16 num_examples: 1980468 - name: train_eurospeech num_bytes: 241659687916.69 num_examples: 502515 - name: train_de_DE_kerstin num_bytes: 139989365.876 num_examples: 1374 - name: TV_2021.02_Neutral num_bytes: 7254720000.0 num_examples: 22671 - name: TV_2021.06_Emotional num_bytes: 646400000.0 num_examples: 2020 - name: TV_2022.10_Neutral num_bytes: 3984320000.0 num_examples: 12451 - name: TV_2023.09_Hessisch num_bytes: 673920000.0 num_examples: 2106 - name: train_mozilla_english_asr num_bytes: 46096297644 num_examples: 1147812 download_size: 1707069570125.172 dataset_size: 1707069570125.172 configs: - config_name: default data_files: - split: train_el_tts path: data/train_el_tts-* - split: train_mls_0 path: data/train_mls_0-* - split: train_tuda_0 path: data/train_tuda_0-* - split: train_cv19_0 path: data/train_cv19_0-* - split: train_emilia_yodas0 path: data/train_emilia_yodas0-* - split: train_eurospeech path: data/train_eurospeech-* - split: train_de_DE_kerstin path: data/train_de_DE_kerstin-* - split: TV_2021.02_Neutral path: data/TV_2021.02_Neutral-* - split: TV_2021.06_Emotional path: data/TV_2021.06_Emotional-* - split: TV_2022.10_Neutral path: data/TV_2022.10_Neutral-* - split: TV_2023.09_Hessisch path: data/TV_2023.09_Hessisch-* - split: train_mozilla_english_asr path: data/train_mozilla_english_asr-* language: - de - en task_categories: - automatic-speech-recognition - translation tags: - german - speech - asr - tts - translation - multilingual pretty_name: Translated German-English ASR Dataset --- # Translated German-English ASR Dataset A large-scale, multi-source German speech dataset with paired English translations, designed for training and evaluating German **Automatic Speech Recognition (ASR)**, **Speech Translation**, and **Text-to-Speech (TTS)** systems. This dataset is a curated mixture of well-established open-source German and multilingual speech corpora, all unified under a common schema with German audio, original German transcriptions, and English translations. <table style="border-collapse: collapse; border: none;"> <tr style="border: none;"> <td style="border: none; padding: 0 20px;"> <a href="https://hpi.de/ki-servicezentrum/"> <img src="https://docs.sc.hpi.de/attachments/aisc/aisc-logo.png" alt="KI-Servicezentrum Berlin-Brandenburg" style="height: 60px; width: auto;"> </a> </td> <td style="border: none; padding: 0 20px;"> <a href="https://www.bmftr.bund.de"> <img src="https://docs.sc.hpi.de/attachments/aisc/bmftr.jpg" alt="Gefoerdert durch BMFTR" style="height: 60px; width: auto;"> </a> </td> </tr> </table> --- ## Dataset Summary | Property | Value | |---|---| | **Primary Language** | German (de) | | **Translation Language** | English (en) | | **Total Examples** | 4,667,327 | | **Total Size (disk)** | ~1.71 TB (download) / ~1.71 TB (uncompressed) | | **Audio Format** | Variable (WAV/FLAC/MP3, 16kHz-44.1kHz) | | **Tasks** | ASR, Speech Translation, TTS | --- ## Dataset Structure ### Features Each example contains the following fields: | Field | Type | Description | |---|---|---| | `audio` | `Audio` | The audio file, automatically decoded and resampled on access | | `transcription` | `string` | Original German transcription of the spoken audio | | `translation` | `string` | English translation of the German transcription | ### Splits The dataset is organized into 12 splits, each sourced from a distinct German speech corpus: | Split | Source Corpus | Examples | Approx. Size | License | |---|---|---|---|---| | `train_el_tts` | Custom TTS (Greek-Letters / Eliza-style German TTS) | 495 | ~670 MB | Contact provider | | `train_mls_0` | [Multilingual LibriSpeech (MLS) - German](https://huggingface.co/datasets/facebook/multilingual_librispeech) | 406,042 | ~182 GB | CC BY 4.0 | | `train_tuda_0` | [Tuda-De (TU Darmstadt German ASR)](https://huggingface.co/datasets/uhhlt/Tuda-De) | 42,812 | ~13 GB | CC BY 4.0 | | `train_cv19_0` | [Mozilla Common Voice 19 - German](https://commonvoice.mozilla.org/de/datasets) | 546,561 | ~234 GB | CC0 1.0 | | `train_emilia_yodas0` | [Emilia-YODAS (German subset)](https://huggingface.co/datasets/amphion/Emilia-Dataset) | 1,980,468 | ~911 GB | CC BY 4.0 | | `train_eurospeech` | [EuroSpeech - German Parliament](https://huggingface.co/datasets/disco-eth/EuroSpeech) | 502,515 | ~230 GB | Per-parliament (see below) | | `train_de_DE_kerstin` | [M-AILABS - de_DE_kerstin](https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/) | 1,374 | ~134 MB | M-AILABS BSD-3-Clause style | | `TV_2021.02_Neutral` | [Thorsten-Voice 2021.02 - Neutral](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full) | 22,671 | ~6.9 GB | CC0 1.0 | | `TV_2021.06_Emotional` | [Thorsten-Voice 2021.06 - Emotional](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full) | 2,020 | ~617 MB | CC0 1.0 | | `TV_2022.10_Neutral` | [Thorsten-Voice 2022.10 - Neutral](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full) | 12,451 | ~3.8 GB | CC0 1.0 | | `TV_2023.09_Hessisch` | [Thorsten-Voice 2023.09 - Hessisch Dialect](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full) | 2,106 | ~642 MB | CC0 1.0 | | `train_mozilla_english_asr` | Mozilla Common Voice 25 - English (translated) | 1,147,812 | ~42.9 GB | CC0 1.0 | **Total: ~4,667,327 examples · ~1.71 TB** --- ## Usage ### Load a Specific Split ```python from datasets import load_dataset # Load a single split dataset = load_dataset("aman4014/translated-german-english-asr", split="train_mls_0") print(dataset[0]) ``` ### Load All Splits ```python from datasets import load_dataset dataset = load_dataset("aman4014/translated-german-english-asr") print(dataset.keys()) # dict_keys(['train_el_tts', 'train_mls_0', 'train_tuda_0', # 'train_cv19_0', 'train_emilia_yodas0', 'train_eurospeech', # 'train_de_DE_kerstin', 'TV_2021.02_Neutral', 'TV_2021.06_Emotional', # 'TV_2022.10_Neutral', 'TV_2023.09_Hessisch', 'train_mozilla_english_asr']) ``` ### Streaming (Recommended for Large Splits) ```python from datasets import load_dataset # Stream large splits to avoid downloading everything at once dataset = load_dataset( "aman4014/translated-german-english-asr", split="train_emilia_yodas0", streaming=True ) for example in dataset.take(5): print(example["transcription"]) print(example["translation"]) ``` ### Access Audio ```python from datasets import load_dataset dataset = load_dataset("aman4014/translated-german-english-asr", split="train_cv19_0") # Audio is decoded on access sample = dataset[0] audio_array = sample["audio"]["array"] # numpy array sampling_rate = sample["audio"]["sampling_rate"] # e.g. 16000 transcription = sample["transcription"] # German text translation = sample["translation"] # English text ``` --- ## Source Datasets and Descriptions ### Multilingual LibriSpeech (MLS) - `train_mls_0` MLS is a large-scale multilingual corpus derived from LibriVox audiobooks, covering 8 languages including German. The German subset contains ~1,000 hours of read-speech data from public-domain books. Produced by Facebook AI Research (Meta). - **Paper:** [MLS: A Large-Scale Multilingual Dataset for Speech Research](https://arxiv.org/abs/2012.03411) (Pratap et al., 2020) - **Source:** [facebook/multilingual_librispeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) - **License:** CC BY 4.0 - **Style:** Read speech (audiobooks) ### Emilia-YODAS - `train_emilia_yodas0` Emilia-YODAS is a large-scale multilingual speech dataset processed via the Emilia-Pipe pipeline. This subset is part of the larger Emilia-Large release (~216,000 hours total) and is released under the permissive CC BY 4.0 license. - **Paper:** [Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset](https://arxiv.org/abs/2501.15907) (He et al., 2025) - **Source:** [amphion/Emilia-Dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset) - **License:** CC BY 4.0 - **Style:** In-the-wild / spontaneous speech ### Tuda-De - `train_tuda_0` The Tuda-De corpus is a German read-speech dataset recorded at TU Darmstadt using multiple microphones (Kinect, Realtek, Headset). Speakers read sentences from German Wikipedia, the Europarl corpus, and web-crawled text. It is one of the foundational open-source German ASR datasets. - **Paper:** [Open Source German Distant Speech Recognition: Corpus and Acoustic Model](https://doi.org/10.1007/978-3-319-24033-6_54) (Radeck-Arneth et al., 2015) - **Source:** [uhhlt/Tuda-De](https://huggingface.co/datasets/uhhlt/Tuda-De) - **License:** CC BY 4.0 - **Style:** Read speech, multi-microphone, controlled environment ### Common Voice 19 - `train_cv19_0` Mozilla Common Voice is a massively multilingual, crowd-sourced speech corpus. Volunteers record text prompts and validate each other's recordings, resulting in diverse speech with varied accents, ages, and genders. Version 19 covers 129+ languages. - **Paper:** [Common Voice: A Massively-Multilingual Speech Corpus](https://arxiv.org/abs/1912.06670) (Ardila et al., 2020) - **Source:** [mozilla-foundation/common_voice_19_0](https://commonvoice.mozilla.org/de/datasets) - **License:** CC0 1.0 (Public Domain) - **Style:** Crowd-sourced read speech, diverse speakers ### EuroSpeech - `train_eurospeech` EuroSpeech is a large-scale multilingual corpus of parliamentary speech from 22 European nations, aligned using a novel two-stage dynamic algorithm. The German subset is sourced from Bundestag/Bundesrat sessions. Licensing reflects the open-access policies of each national parliament; parliamentary speech in most European jurisdictions is released for public use. - **Paper:** [EuroSpeech: A Multilingual Speech Corpus](https://arxiv.org/abs/2510.00514) (Pfisterer et al., 2025) - **Source:** [disco-eth/EuroSpeech](https://huggingface.co/datasets/disco-eth/EuroSpeech) - **License:** Per-country parliamentary open-access terms (see dataset card for full breakdown) - **Style:** Parliamentary / formal speech ### M-AILABS - de_DE_kerstin - `train_de_DE_kerstin` The M-AILABS Speech Dataset is a multi-language TTS/ASR corpus based on LibriVox public domain audiobooks and Project Gutenberg texts. The `de_DE_kerstin` split corresponds to a single German female speaker ("Kerstin") reading audiobook passages. - **Source:** [M-AILABS Speech Dataset](https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/) - **License:** M-AILABS BSD-3-Clause style license (attribution required, no endorsement) - **Style:** Read speech (audiobooks), single speaker ### Thorsten-Voice - `TV_2021.02_Neutral`, `TV_2021.06_Emotional`, `TV_2022.10_Neutral`, `TV_2023.09_Hessisch` Thorsten-Voice is a freely contributed German TTS voice dataset by Thorsten Muller, a single male native German speaker. It encompasses multiple recording sessions covering neutral speech, emotional speech (angry, disgusted, amused, drunk, surprised, sleepy, whisper), an updated neutral session, and a Hessian dialect (Hessisch) session. All recordings are at 44.1kHz stereo and are released under the completely unrestricted CC0 public domain license. - **Source:** [Thorsten-Voice/TV-44kHz-Full](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full) - **License:** CC0 1.0 (Public Domain) - **Style:** TTS-quality read speech, single male native German speaker | Subset | Recording Session | Style | |---|---|---| | `TV_2021.02_Neutral` | Feb 2021 | Neutral, clear | | `TV_2021.06_Emotional` | Jun 2021 | Emotional (7 emotion categories) | | `TV_2022.10_Neutral` | Oct 2022 | Neutral, high quality, LJSpeech-compatible | | `TV_2023.09_Hessisch` | Sep 2023 | Hessian regional dialect | ### Mozilla Common Voice 25 - English - `train_mozilla_english_asr` The Mozilla Common Voice English split uses validated English clips with English transcriptions. This dataset uses the English audio and translates the transcription into German to keep the same schema as the German corpora. - **Source:** [mozilla-foundation/common_voice_25_0](https://commonvoice.mozilla.org/en/datasets) - **License:** CC0 1.0 (Public Domain) - **Style:** Crowd-sourced read speech, diverse speakers --- ## Dataset Sources and Licensing This dataset is a mixture of several German and multilingual speech datasets. For each dataset, the license of the **original author** applies. Please consult the linked sources for detailed licensing information and terms of use. | Split | Source Dataset | License | Commercial Use | Link | |---|---|---|---|---| | `train_el_tts` | Custom TTS | Unknown - contact provider | Unknown | N/A | | `train_mls_0` | Multilingual LibriSpeech (MLS) | **CC BY 4.0** | Yes | [openslr.org/94](https://www.openslr.org/94/) | | `train_tuda_0` | Tuda-De | **CC BY 4.0** | Yes | [uhhlt/Tuda-De](https://huggingface.co/datasets/uhhlt/Tuda-De) | | `train_cv19_0` | Mozilla Common Voice 19 | **CC0 1.0** | Yes | [commonvoice.mozilla.org](https://commonvoice.mozilla.org/de/datasets) | | `train_emilia_yodas0` | Emilia-YODAS | **CC BY 4.0** | Yes | [amphion/Emilia-Dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset) | | `train_eurospeech` | EuroSpeech | Per-parliament open access | Verify per country | [disco-eth/EuroSpeech](https://huggingface.co/datasets/disco-eth/EuroSpeech) | | `train_de_DE_kerstin` | M-AILABS | M-AILABS BSD-3-Clause | Yes (with attribution) | [caito.de](https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/) | | `TV_2021.02_Neutral` | Thorsten-Voice | **CC0 1.0** | Yes | [Thorsten-Voice/TV-44kHz-Full](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full) | | `TV_2021.06_Emotional` | Thorsten-Voice | **CC0 1.0** | Yes | [Thorsten-Voice/TV-44kHz-Full](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full) | | `TV_2022.10_Neutral` | Thorsten-Voice | **CC0 1.0** | Yes | [Thorsten-Voice/TV-44kHz-Full](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full) | | `TV_2023.09_Hessisch` | Thorsten-Voice | **CC0 1.0** | Yes | [Thorsten-Voice/TV-44kHz-Full](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full) | | `train_mozilla_english_asr` | Mozilla Common Voice 25 - English (translated) | **CC0 1.0** | Yes | [commonvoice.mozilla.org](https://commonvoice.mozilla.org/en/datasets) | > **Note:** If a dataset does not have a public source listed, please contact the dataset provider or refer to your data distributor for licensing details. ### License Summary - **CC0 1.0 (Public Domain):** No restrictions. Can be used commercially, modified, and redistributed without attribution. Applies to Common Voice and all Thorsten-Voice splits. - **CC BY 4.0:** Free to use commercially and non-commercially, with attribution. Applies to MLS, Tuda-De, and Emilia-YODAS. - **M-AILABS BSD-3-Clause:** Similar to BSD-3. Commercial use permitted with attribution; no endorsement of products derived from the data. - **EuroSpeech / Parliamentary:** Licensed under the open-access terms of each respective national parliament. Generally free for research use; commercial use should be verified per country. --- ## Licensing Disclaimer The use of this dataset and any derived models must comply with the licenses of the **original underlying datasets**. The most restrictive license in any given use case applies: - **For fully open / commercial use:** Verify EuroSpeech country-level terms. - **For non-commercial research:** All splits are generally usable, subject to attribution where required. - **For the `train_el_tts` split:** License information is unknown. Contact the dataset provider before use. The dataset creators make no representations or warranties regarding these datasets, including warranties of non-infringement or fitness for a particular purpose. The dataset creators do not claim any rights to the datasets themselves - all rights remain with the original data owners. **Always verify the license terms before using this data or any models trained on it for commercial or research purposes.** --- ## Acknowledgements This dataset would not have been possible without the contributions of the following organizations and open-source communities: * **Mozilla Common Voice**: For providing a massively multilingual, freely available crowd-sourced speech corpus * **Facebook AI Research (Meta)**: For the Multilingual LibriSpeech dataset * **Amphion / Emilia Team**: For the Emilia-YODAS large-scale multilingual speech dataset * **TU Darmstadt**: For the Tuda-De German ASR corpus * **disco-eth / EuroSpeech Team**: For the multilingual European parliamentary speech corpus * **Thorsten Muller**: For the freely contributed Thorsten-Voice German TTS dataset * **M-AILABS**: For the multilingual audiobook-based speech corpus * **AI Service Center Berlin-Brandenburg (KI-Servicezentrum)**: For supporting this work <table style="border-collapse: collapse; border: none;"> <tr style="border: none;"> <td style="border: none; padding: 0 20px;"> <a href="https://hpi.de/ki-servicezentrum/"> <img src="https://docs.sc.hpi.de/attachments/aisc/aisc-logo.png" alt="KI-Servicezentrum Berlin-Brandenburg" style="height: 60px; width: auto;"> </a> </td> <td style="border: none; padding: 0 20px;"> <a href="https://www.bmftr.bund.de"> <img src="https://docs.sc.hpi.de/attachments/aisc/bmftr.jpg" alt="Gefoerdert durch BMFTR" style="height: 60px; width: auto;"> </a> </td> </tr> </table> **Funding Notice** Das zugrunde liegende Vorhaben wurde mit Mitteln des Bundesministeriums fur Forschung, Technologie und Raumfahrt unter dem Foerderkennzeichen "KI-Servicezentrum Berlin-Brandenburg" 16IS22092 gefoerdert. Die Verantwortung fur den Inhalt dieser Veroeffentlichung liegt beim Autor. _This project was funded by the German Federal Ministry of Research, Technology and Space under the funding code "KI-Servicezentrum Berlin-Brandenburg" 16IS22092. Responsibility for the content of this publication remains with the author._ --- ## Citation If you use this dataset, please cite the original source datasets as appropriate: **Multilingual LibriSpeech:** ```bibtex @article{Pratap2020MLSAL, title={MLS: A Large-Scale Multilingual Dataset for Speech Research}, author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert}, journal={ArXiv}, year={2020}, volume={abs/2012.03411} } ``` **Emilia-YODAS:** ```bibtex @inproceedings{emilia, author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng}, title={Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation}, booktitle={Proc. of SLT}, year={2024} } ``` **Tuda-De:** ```bibtex @InProceedings{Radeck-Arneth2015, author="Radeck-Arneth, Stephan and Milde, Benjamin and Lange, Arvid and Gouvea, Evandro and Radomski, Stefan and Muhlhauser, Max and Biemann, Chris", title="Open Source German Distant Speech Recognition: Corpus and Acoustic Model", booktitle="Text, Speech, and Dialogue", year="2015", publisher="Springer International Publishing", pages="480--488", doi="10.1007/978-3-319-24033-6_54" } ``` **Mozilla Common Voice:** ```bibtex @inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 } ``` **EuroSpeech:** ```bibtex @article{pfisterer2025eurospeech, title={EuroSpeech: A Multilingual Speech Corpus}, author={Samuel Pfisterer and Florian Grotschla and Luca Lanzendorfer and Florian Yan and Roger Wattenhofer}, year={2025} } ``` **Thorsten-Voice:** ```bibtex @misc{thorsten_muller_2024, author = {{Thorsten Muller}}, title = {TV-44kHz-Full}, year = 2024, url = {https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full}, doi = {10.57967/hf/3290}, publisher = {Hugging Face} } ``` **M-AILABS:** ```bibtex @misc{MAILABS_2017, author = {Solak, I. Celeste Aurora and Naumov, Dima}, title = {The M-AILABS Speech Dataset}, year = {2017}, howpublished = {\url{https://github.com/i-celeste-aurora/m-ailabs-dataset}} } ```
提供机构:
aman4014
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作