five

LoquaciousSet

收藏
魔搭社区2026-01-06 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/speechbrain/LoquaciousSet
下载链接
链接失效反馈
官方服务:
资源简介:
# LargeScaleASR: 25,000 hours of transcribed and heterogeneous English speech recognition data for research and commercial use. The full details [are available in the paper](https://arxiv.org/abs/2505.21578). Made of 6 subsets: 1. **large** contains 25,000 hours of read / spontaneous and clean / noisy transcribed speech. 2. **medium** contains 2,500 hours of read / spontaneous and clean / noisy transcribed speech. 3. **small** contains 250 hours of read / spontaneous and clean / noisy transcribed speech. 4. **clean** contains 13,000 hours of read / spontaneous transcribed speech. YODA and People's Speech data are excluded from this subset as, despite data curation, some errors remain in the transcriptions. 5. **dev** contains 15 hours (more details in the next section). 6. **test** contains 21 hours (more details in the next section). The large split requires 4TB of storage (including HuggingFace extraction). The shards only are 2TB. Example: ```python from datasets import load_dataset ds = load_dataset('speechbrain/LoquaciousSet', {'small'||'medium'||'large'}, num_proc={nb_of_cpu_cores_you_want}) print(ds['train']) from io import BytesIO import torchaudio wav_tensor = torchaudio.load(BytesIO(ds["train"][0]["wav"][bytes])) ``` ## Training recipe A full conformer ASR training recipe is available [here](https://github.com/speechbrain/speechbrain/pull/2806). ## Data description (Following information are directly copy-pasted from the SpeechBrain data preparation README) TLS is a mix of 5 existing dataset with permissive licences. The way it is mixed is described in the following table: | Dataset | Amount Taken (large/medium/small/dev/test) | License | | ------------- | ------------- | ------------- | | VoxPopuli | 550/500/50/5/7 | CC0 | | LibriHeavy | 11,000/500/50/0/0 | CC BY 4.0 | | Librispeech (dev-/test-other) | 0/0/0/5/7 | CC BY 4.0 | | yodas | 6,100/500/50/1.5/1.5 | CC BY 3.0 | | people's speech | 5,900/500/50/1.5/1.5 | CC-BY 4.0 | | CommonVoice 18.0 | 1660/500/50/5/7 | CC0 | *For dev and tests splits, only data from the corresponding dev and test sets of the considered dataset is used (i.e. not extracted from the train except for YODAS). For YODAS we extract data from the en003 split and verify the audio/transcription manually to form the dev/test partitions* More information relative to each dataset is given as: - [**voxpopuli**](https://arxiv.org/abs/2101.00390): we follow the standard SpeechBrain data preparation. - [**LibriHeavy**](https://arxiv.org/html/2309.08105v2): samples are randomly selected, but we follow the standard data preparation. - [**Librispeech**](https://www.danielpovey.com/files/2015_icassp_librispeech.pdf): Librispeech is only used for the validation and test sets of LargeScaleASR. More precisely, we extract samples from *dev-others* and *test-others* as they are the most challenging subsets. - [**YODAS**](https://arxiv.org/abs/2406.00899): The YODAS dataset is unfortunately unreliable. Indeed, audio are crawled from YouTube, and a lot of them (almost half) do not have the correct language. We used a [SpeechBrain language ID model](https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa) to make sure that we only integrate samples where people speak in English. Transcriptions have also been heavily normalised (see next section). We decided arbitrarily to use the *en000* and *en001* subsets of Yodas. Transcriptions may be a bit noisy. This is why we manually transcribed data for the dev and test sets. - [**People's Speech**](https://huggingface.co/datasets/MLCommons/peoples_speech): Only the *clean* subset of this dataset is used in LargeScaleASR as the transcriptions there already have errors. - [**CommonVoice 18.0**](https://commonvoice.mozilla.org/en): We removed a few speakers that had too many samples (above 9000 samples) to avoid any bias. Aside from this, we used only samples coming from the *validated* csv to ensure an optimal level of transcriptions. Text was also heavily normalised (see next section). ### Text and audio normalisation Some of the above datasets, in particular People's Speech, Yodas and CommonVoice have very little normalisation. This is an important issue as the pronunciation is then either incorrect or uncertain. We normalised all the sentences to ensure a set of characters containing only the standard 26 letter of the European alphabet plus the "'". Numerical values were converted to text using the [Nemo text processing WFST tool](https://github.com/NVIDIA/NeMo-text-processing). The rest of the text was properly filtered to remove symbols, youtube annotations like "applause" or many others elements. When sentences were too noisy, we simply decided to remove them (e.g. too many symbols). The text normalisation can be found in *speechbrain.utils.text_normalisation*. Audios are embedded as raw bytes (can be decoded by soundfile). We chunked and created smaller audio files from long ones based on start and stop supervision from the different manifests of the datasets (this is necessary for HuggingFace). Language ID with a [SpeechBrain language ID model](https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa) was performed on Yodas. ### Count-based Language Models and Lexicon The dataset includes three ARPA-format count-based language models trained on the text of the *train.large* subset of Loquacious: | LM | Size | Binary Size | Dev. Perplexity | | --- | --- | --- | --- | | 3-gram pruned | 331MB | 721MB | 222 | | 4-gram pruned | 538MB | 1.2GB | 202 | | 4-gram unpruned | 2.4GB | 4.7GB | 193 | Each language model is limited to a vocabulary containing 216k words. We also provide a pronunciation lexicon using ARPA-style phonemes containing one or multiple pronunciations for each of the words in the vocabulary. The original pronunciations are based on [CMUDict 0.7b](http://www.speech.cs.cmu.edu/cgi-bin/cmudict). Missing pronunciations were generated using [Sequitur](https://github.com/sequitur-g2p/sequitur-g2p), for which we also provide the trained G2P model. #### Referencing the Loquacious Set and SpeechBrain ``` @inproceedings{Loquacious, title = {Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use}, author = {Titouan Parcollet and Yuan Tseng and Shucong Zhang and Rogier van Dalen}, year = {2025}, booktitle = {Interspeech 2025}, } @article{speechbrainV1, author = {Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Ha Nguyen and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Ga{{\"e}}lle Laperri{{\`e}}re and Mickael Rouvier and Renato De Mori and Yannick Est{{\`e}}ve}, title = {Open-Source Conversational AI with SpeechBrain 1.0}, journal = {Journal of Machine Learning Research}, year = {2024}, volume = {25}, number = {333}, pages = {1--11}, url = {http://jmlr.org/papers/v25/24-0991.html} } ``` #### About SpeechBrain SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to be simple, extremely flexible, and user-friendly. Competitive or state-of-the-art performance is obtained in various domains. Website: https://speechbrain.github.io/ GitHub: https://github.com/speechbrain/speechbrain

# Loquacious Set(LargeScaleASR):面向研究与商业应用的25000小时转录异构英语语音识别数据集 完整细节可参阅[论文](https://arxiv.org/abs/2505.21578)。 本数据集包含6个子集: 1. **large**:包含25000小时的朗读/自发式、清晰/带噪转录语音。 2. **medium**:包含2500小时的朗读/自发式、清晰/带噪转录语音。 3. **small**:包含250小时的朗读/自发式、清晰/带噪转录语音。 4. **clean**:包含13000小时的朗读/自发式转录语音。由于尽管经过数据整理,转录文本仍存在部分错误,因此该子集排除了YODAS与People's Speech数据。 5. **dev**:包含15小时数据(更多细节见下一节)。 6. **test**:包含21小时数据(更多细节见下一节)。 large拆分版本需占用4TB存储空间(包含HuggingFace解压后的总大小),仅分片文件的大小为2TB。 示例: python from datasets import load_dataset ds = load_dataset('speechbrain/LoquaciousSet', {'small'||'medium'||'large'}, num_proc={nb_of_cpu_cores_you_want}) print(ds['train']) from io import BytesIO import torchaudio wav_tensor = torchaudio.load(BytesIO(ds["train"][0]["wav"][bytes])) ## 训练流程 完整的Conformer语音识别训练流程可参阅[此处](https://github.com/speechbrain/speechbrain/pull/2806)。 ## 数据说明(以下内容直接摘自SpeechBrain数据准备README文档) TLS数据集由5个许可协议宽松的现有数据集混合而成,其混合方案如下表所示: | 数据集 | 抽取数据量(large/medium/small/dev/test) | 许可协议 | | ------------- | ------------- | ------------- | | VoxPopuli | 550/500/50/5/7 | CC0 | | LibriHeavy | 11000/500/50/0/0 | CC BY 4.0 | | Librispeech (dev-/test-other) | 0/0/0/5/7 | CC BY 4.0 | | YODAS | 6100/500/50/1.5/1.5 | CC BY 3.0 | | People's Speech | 5900/500/50/1.5/1.5 | CC-BY 4.0 | | CommonVoice 18.0 | 1660/500/50/5/7 | CC0 | *注:对于dev与test划分,仅使用对应数据集自带的dev与test集数据(YODAS除外,其数据并非来自训练集拆分)。对于YODAS,我们从en003拆分中抽取数据,并手动校验音频与转录文本,以构建dev/test划分集。* 各数据集的详细说明如下: - **VoxPopuli**:我们遵循标准SpeechBrain数据准备流程。[详见论文](https://arxiv.org/abs/2101.00390) - **LibriHeavy**:样本为随机抽取,同时遵循标准数据准备流程。[详见论文](https://arxiv.org/html/2309.08105v2) - **Librispeech**:Librispeech仅用于LargeScaleASR的验证与测试集。具体而言,我们抽取*dev-others*与*test-others*子集,因其为难度最高的子集。[详见论文](https://www.danielpovey.com/files/2015_icassp_librispeech.pdf) - **YODAS**:遗憾的是,YODAS数据集可靠性较差。其音频数据从YouTube爬取,近半数音频的语言并非英语。我们使用[SpeechBrain语言识别模型](https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa),仅保留英语语音样本。转录文本已进行大量归一化处理(详见下一节)。我们选定使用YODAS的*en000*与*en001*子集,其转录文本可能存在一定噪声。因此我们手动转录了dev与test划分集的数据。[详见论文](https://arxiv.org/abs/2406.00899) - **People's Speech**:本数据集仅使用其*clean*子集,因其原转录文本已存在错误。[详见数据集页面](https://huggingface.co/datasets/MLCommons/peoples_speech) - **CommonVoice 18.0**:我们移除了样本量超过9000的少数说话人,以避免偏差。除此之外,我们仅使用*validated* csv中的样本,以保证转录文本的质量。文本也进行了大量归一化处理(详见下一节)。[详见官网](https://commonvoice.mozilla.org/en) ### 文本与音频归一化 部分上述数据集(尤其是People's Speech、YODAS与CommonVoice)的归一化程度极低,这会导致发音不准确或存在不确定性。我们对所有句子进行归一化处理,确保字符集仅包含欧洲标准26个字母及单引号`'`。数值使用[Nemo文本处理加权有限状态转换器(WFST)工具](https://github.com/NVIDIA/NeMo-text-processing)转换为文本形式,其余文本则经过严格过滤,以移除符号、YouTube标注(如“applause”,即掌声)等无关元素。若句子噪声过大,我们直接将其移除(如包含过多符号的句子)。文本归一化代码可在*speechbrain.utils.text_normalisation*中找到。 音频以原始字节形式存储(可通过soundfile解码)。我们根据各数据集manifest文件中的起止时间标注,将长音频切分为更小的音频片段(这是适配HuggingFace的必要操作)。针对YODAS数据集,我们使用[SpeechBrain语言识别模型](https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa)进行了语言识别校验。 ### 基于计数的语言模型与发音词典 本数据集包含三个基于ARPA格式的计数型语言模型,均在Loquacious Set的*train.large*子集文本上训练得到: | 语言模型 | 大小 | 二进制文件大小 | 开发集困惑度 | | --- | --- | --- | --- | | 3-gram剪枝版 | 331MB | 721MB | 222 | | 4-gram剪枝版 | 538MB | 1.2GB | 202 | | 4-gram未剪枝版 | 2.4GB | 4.7GB | 193 | 每个语言模型的词表均限制为21.6万个单词。 我们还提供了采用ARPA风格音素的发音词典,可为词表中的每个单词提供一个或多个发音。原始发音基于[CMUDict 0.7b](http://www.speech.cs.cmu.edu/cgi-bin/cmudict),缺失的发音则使用[Sequitur字素转音素(G2P)工具](https://github.com/sequitur-g2p/sequitur-g2p)生成,我们同时提供了训练好的G2P模型。 #### 引用Loquacious Set与SpeechBrain bibtex @inproceedings{Loquacious, title = {Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use}, author = {Titouan Parcollet and Yuan Tseng and Shucong Zhang and Rogier van Dalen}, year = {2025}, booktitle = {Interspeech 2025}, } @article{speechbrainV1, author = {Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Ha Nguyen and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Ga{{"e}}lle Laperri{{`e}}re and Mickael Rouvier and Renato De Mori and Yannick Est{{`e}}ve}, title = {Open-Source Conversational AI with SpeechBrain 1.0}, journal = {Journal of Machine Learning Research}, year = {2024}, volume = {25}, number = {333}, pages = {1--11}, url = {http://jmlr.org/papers/v25/24-0991.html} } #### 关于SpeechBrain SpeechBrain是一款开源、全能型语音工具包,其设计目标为简洁、高度灵活且易于使用,在多个领域均可实现具有竞争力或顶尖水准的性能。 官方网站:https://speechbrain.github.io/ GitHub仓库:https://github.com/speechbrain/speechbrain
提供机构:
maas
创建时间:
2025-05-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作