five

LargeScaleASR

收藏
魔搭社区2025-12-04 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/speechbrain/LargeScaleASR
下载链接
链接失效反馈
官方服务:
资源简介:
# LargeScaleASR: 25,000 hours of transcribed and heterogeneous English speech recognition data for research and commercial use. The full details [are available in the paper](https://arxiv.org/abs/2505.21578). Made of 6 subsets: 1. **large** contains 25,000 hours of read / spontaneous and clean / noisy transcribed speech. 2. **medium** contains 2,500 hours of read / spontaneous and clean / noisy transcribed speech. 3. **small** contains 250 hours of read / spontaneous and clean / noisy transcribed speech. 4. **clean** contains 13,000 hours of read / spontaneous transcribed speech. YODA and People's Speech data are excluded from this subset as, despite data curation, some errors remain in the transcriptions. 5. **dev** contains 15 hours (more details in the next section). 6. **test** contains 21 hours (more details in the next section). The large split requires 4TB of storage (including HuggingFace extraction). The shards only are 2TB. Example: ```python from datasets import load_dataset ds = load_dataset('speechbrain/LoquaciousSet', {'small'||'medium'||'large'}, num_proc={nb_of_cpu_cores_you_want}) print(ds['train']) from io import BytesIO import torchaudio wav_tensor = torchaudio.load(BytesIO(ds["train"][0]["wav"][bytes])) ``` ## Training recipe A full conformer ASR training recipe is available [here](https://github.com/speechbrain/speechbrain/pull/2806). ## Data description (Following information are directly copy-pasted from the SpeechBrain data preparation README) TLS is a mix of 5 existing dataset with permissive licences. The way it is mixed is described in the following table: | Dataset | Amount Taken (large/medium/small/dev/test) | License | | ------------- | ------------- | ------------- | | VoxPopuli | 550/500/50/5/7 | CC0 | | LibriHeavy | 11,000/500/50/0/0 | CC BY 4.0 | | Librispeech (dev-/test-other) | 0/0/0/5/7 | CC BY 4.0 | | yodas | 6,100/500/50/1.5/1.5 | CC BY 3.0 | | people's speech | 5,900/500/50/1.5/1.5 | CC-BY 4.0 | | CommonVoice 18.0 | 1660/500/50/5/7 | CC0 | *For dev and tests splits, only data from the corresponding dev and test sets of the considered dataset is used (i.e. not extracted from the train except for YODAS). For YODAS we extract data from the en003 split and verify the audio/transcription manually to form the dev/test partitions* More information relative to each dataset is given as: - [**voxpopuli**](https://arxiv.org/abs/2101.00390): we follow the standard SpeechBrain data preparation. - [**LibriHeavy**](https://arxiv.org/html/2309.08105v2): samples are randomly selected, but we follow the standard data preparation. - [**Librispeech**](https://www.danielpovey.com/files/2015_icassp_librispeech.pdf): Librispeech is only used for the validation and test sets of LargeScaleASR. More precisely, we extract samples from *dev-others* and *test-others* as they are the most challenging subsets. - [**YODAS**](https://arxiv.org/abs/2406.00899): The YODAS dataset is unfortunately unreliable. Indeed, audio are crawled from YouTube, and a lot of them (almost half) do not have the correct language. We used a [SpeechBrain language ID model](https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa) to make sure that we only integrate samples where people speak in English. Transcriptions have also been heavily normalised (see next section). We decided arbitrarily to use the *en000* and *en001* subsets of Yodas. Transcriptions may be a bit noisy. This is why we manually transcribed data for the dev and test sets. - [**People's Speech**](https://huggingface.co/datasets/MLCommons/peoples_speech): Only the *clean* subset of this dataset is used in LargeScaleASR as the transcriptions there already have errors. - [**CommonVoice 18.0**](https://commonvoice.mozilla.org/en): We removed a few speakers that had too many samples (above 9000 samples) to avoid any bias. Aside from this, we used only samples coming from the *validated* csv to ensure an optimal level of transcriptions. Text was also heavily normalised (see next section). ### Text and audio normalisation Some of the above datasets, in particular People's Speech, Yodas and CommonVoice have very little normalisation. This is an important issue as the pronunciation is then either incorrect or uncertain. We normalised all the sentences to ensure a set of characters containing only the standard 26 letter of the European alphabet plus the "'". Numerical values were converted to text using the [Nemo text processing WFST tool](https://github.com/NVIDIA/NeMo-text-processing). The rest of the text was properly filtered to remove symbols, youtube annotations like "applause" or many others elements. When sentences were too noisy, we simply decided to remove them (e.g. too many symbols). The text normalisation can be found in *speechbrain.utils.text_normalisation*. Audios are embedded as raw bytes (can be decoded by soundfile). We chunked and created smaller audio files from long ones based on start and stop supervision from the different manifests of the datasets (this is necessary for HuggingFace). Language ID with a [SpeechBrain language ID model](https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa) was performed on Yodas. ### Count-based Language Models and Lexicon The dataset includes three ARPA-format count-based language models trained on the text of the *train.large* subset of Loquacious: | LM | Size | Binary Size | Dev. Perplexity | | --- | --- | --- | --- | | 3-gram pruned | 331MB | 721MB | 222 | | 4-gram pruned | 538MB | 1.2GB | 202 | | 4-gram unpruned | 2.4GB | 4.7GB | 193 | Each language model is limited to a vocabulary containing 216k words. We also provide a pronunciation lexicon using ARPA-style phonemes containing one or multiple pronunciations for each of the words in the vocabulary. The original pronunciations are based on [CMUDict 0.7b](http://www.speech.cs.cmu.edu/cgi-bin/cmudict). Missing pronunciations were generated using [Sequitur](https://github.com/sequitur-g2p/sequitur-g2p), for which we also provide the trained G2P model. #### Referencing the Loquacious Set and SpeechBrain ``` @inproceedings{Loquacious, title = {Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use}, author = {Titouan Parcollet and Yuan Tseng and Shucong Zhang and Rogier van Dalen}, year = {2025}, booktitle = {Interspeech 2025}, } @article{speechbrainV1, author = {Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Ha Nguyen and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Ga{{\"e}}lle Laperri{{\`e}}re and Mickael Rouvier and Renato De Mori and Yannick Est{{\`e}}ve}, title = {Open-Source Conversational AI with SpeechBrain 1.0}, journal = {Journal of Machine Learning Research}, year = {2024}, volume = {25}, number = {333}, pages = {1--11}, url = {http://jmlr.org/papers/v25/24-0991.html} } ``` #### About SpeechBrain SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to be simple, extremely flexible, and user-friendly. Competitive or state-of-the-art performance is obtained in various domains. Website: https://speechbrain.github.io/ GitHub: https://github.com/speechbrain/speechbrain

# 大规模自动语音识别数据集(LargeScaleASR):25000小时转录异构英语语音识别数据,适用于研究与商业用途。 完整细节可查阅该论文:https://arxiv.org/abs/2505.21578。 该数据集包含6个子集: 1. **large**:包含25000小时朗读/自发式、清晰/带噪转录语音数据。 2. **medium**:包含2500小时朗读/自发式、清晰/带噪转录语音数据。 3. **small**:包含250小时朗读/自发式、清晰/带噪转录语音数据。 4. **clean**:包含13000小时朗读/自发式转录语音数据。由于尽管经过数据清洗,转录文本仍存在少量错误,因此该子集剔除了YODAS与People's Speech数据集的数据。 5. **dev(开发集)**:包含15小时数据(更多细节见下一节)。 6. **test(测试集)**:包含21小时数据(更多细节见下一节)。 large划分集需占用4TB存储空间(含HuggingFace解压后占用),仅分片文件则占用2TB。 示例代码: python from datasets import load_dataset ds = load_dataset('speechbrain/LoquaciousSet', {'small'||'medium'||'large'}, num_proc={nb_of_cpu_cores_you_want}) print(ds['train']) from io import BytesIO import torchaudio wav_tensor = torchaudio.load(BytesIO(ds["train"][0]["wav"][bytes])) ## 训练配置 完整的卷积Transformer(Conformer)自动语音识别(Automatic Speech Recognition,ASR)标准训练流程可在此查阅:https://github.com/speechbrain/speechbrain/pull/2806。 ## 数据详情(以下内容直接摘自SpeechBrain数据准备README文档) 本数据集(TLS)由5个持有宽松许可协议的现有数据集混合而成,其组合方式如下表所示: | 数据集名称 | 分配数据量(large/medium/small/dev/test) | 许可协议 | | ------------- | ------------- | ------------- | | VoxPopuli | 550/500/50/5/7 | CC0 | | LibriHeavy | 11000/500/50/0/0 | CC BY 4.0 | | Librispeech (dev-/test-other) | 0/0/0/5/7 | CC BY 4.0 | | yodas | 6100/500/50/1.5/1.5 | CC BY 3.0 | | people's speech | 5900/500/50/1.5/1.5 | CC-BY 4.0 | | CommonVoice 18.0 | 1660/500/50/5/7 | CC0 | *注:对于dev与test划分集,仅使用对应数据集自带的dev与test子集数据(即除YODAS外,均不从训练集中提取)。针对YODAS,我们从en003划分集中提取数据,并人工校验音频与转录文本,以此构建dev/test划分集* 各数据集的详细说明如下: - [**voxpopuli**](https://arxiv.org/abs/2101.00390): 我们遵循标准SpeechBrain数据准备流程。 - [**LibriHeavy**](https://arxiv.org/html/2309.08105v2): 样本随机选取,同时遵循标准数据准备流程。 - [**Librispeech**](https://www.danielpovey.com/files/2015_icassp_librispeech.pdf): Librispeech仅用于LargeScaleASR的验证与测试集。具体而言,我们从*dev-others*与*test-others*子集提取样本,因其为难度最高的子集。 - [**YODAS**](https://arxiv.org/abs/2406.00899): 遗憾的是,YODAS数据集存在可靠性问题。其音频数据从YouTube爬取,近半数样本语言不符合要求。我们使用[SpeechBrain语言识别模型(speechbrain/lang-id-voxlingua107-ecapa)](https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa)筛选出仅包含英语语音的样本。转录文本也经过了重度标准化处理(详见下一节)。我们随机选取YODAS的*en000*与*en001*子集,其转录文本可能存在一定噪声。正因如此,我们对dev与test划分集的数据进行了人工转录。 - [**People's Speech**](https://huggingface.co/datasets/MLCommons/peoples_speech): 本数据集仅使用People's Speech的*clean*子集,因其自带的转录文本已存在错误。 - [**CommonVoice 18.0**](https://commonvoice.mozilla.org/en): 我们移除了样本量超过9000的部分说话人,以避免数据偏差。除此之外,我们仅使用*validated* csv文件中的样本,以确保转录文本质量。文本内容同样经过了重度标准化处理(详见下一节)。 ### 文本与音频标准化 上述部分数据集(尤其是People's Speech、YODAS与CommonVoice)的标准化程度极低,这会导致发音不准确或存在歧义。我们对所有句子进行了标准化处理,确保仅使用欧洲标准26个字母与单引号`'`作为字符集。数值文本通过[Nemo文本处理加权有限状态换能器(Weighted Finite-State Transducer,WFST)工具](https://github.com/NVIDIA/NeMo-text-processing)转换为自然语言表述。其余文本则经过严格过滤,移除各类符号、YouTube标注(如"applause"即掌声)等无关元素。若句子噪声过大,则直接剔除(例如包含过多符号的句子)。文本标准化代码可在`speechbrain.utils.text_normalisation`中找到。 音频以原始字节格式存储(可通过soundfile解码)。我们根据各数据集标注清单(manifest)中的起止时间信息,将长音频切割为短音频片段(该操作适配HuggingFace平台要求)。针对YODAS数据集,我们使用[SpeechBrain语言识别模型(speechbrain/lang-id-voxlingua107-ecapa)](https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa)完成了语言识别校验。 ### 基于计数的语言模型与发音词典 本数据集包含3个基于ARPA格式的计数型语言模型,均在Loquacious数据集(Loquacious Set)的*train.large*子集文本上训练得到: | 语言模型类型 | 模型大小 | 二进制文件大小 | 开发集困惑度 | | --- | --- | --- | --- | | 3-gram剪枝版 | 331MB | 721MB | 222 | | 4-gram剪枝版 | 538MB | 1.2GB | 202 | | 4-gram未剪枝版 | 2.4GB | 4.7GB | 193 | 每个语言模型的词表均包含21.6万个单词。 我们还提供了基于ARPA格式音素的发音词典,词表中每个单词均对应一个或多个发音。原始发音基于[CMUDict 0.7b](http://www.speech.cs.cmu.edu/cgi-bin/cmudict)生成,对于未覆盖的单词,我们使用[Sequitur](https://github.com/sequitur-g2p/sequitur-g2p)工具生成其字素到音素(Grapheme-to-Phoneme,G2P)转换,并同时提供训练完成的G2P模型。 #### 引用Loquacious数据集与SpeechBrain bibtex @inproceedings{Loquacious, title = {Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use}, author = {Titouan Parcollet and Yuan Tseng and Shucong Zhang and Rogier van Dalen}, year = {2025}, booktitle = {Interspeech 2025}, } @article{speechbrainV1, author = {Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Ha Nguyen and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Ga{{"e}}lle Laperri{{`e}}re and Mickael Rouvier and Renato De Mori and Yannick Est{{`e}}ve}, title = {Open-Source Conversational AI with SpeechBrain 1.0}, journal = {Journal of Machine Learning Research}, year = {2024}, volume = {25}, number = {333}, pages = {1--11}, url = {http://jmlr.org/papers/v25/24-0991.html} } #### 关于SpeechBrain SpeechBrain是一款开源且一体化的语音处理工具包,旨在实现简洁易用、高度灵活的开发体验,在多个领域均可实现具备竞争力或前沿水平的性能。 官方网站:https://speechbrain.github.io/ GitHub仓库:https://github.com/speechbrain/speechbrain
提供机构:
maas
创建时间:
2025-01-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作