FBK-MT/mosel
收藏Hugging Face2025-10-07 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/FBK-MT/mosel
下载链接
链接失效反馈官方服务:
资源简介:
MOSEL语料库是一个多语言数据集集合,包含多达950,000小时的开放源代码语音录音,涵盖了欧盟的24种官方语言。我们通过调查开源合规许可证下的标记和未标记语音语料库来收集数据。特别是,MOSEL包括了来自VoxPopuli和LibriLight的441,000小时未标记语音的自动转录。数据使用Whisper large v3进行转录。Whisper在OS Apache 2.0许可证下发布,该许可证允许在任意许可证下发布生成的内容。由于LibriLight与VoxPopuli不同,包含的片段超过了Whisper的最大持续时间限制30秒,因此我们将它们分割成最多30秒的块。
The MOSEL corpus is a multilingual dataset collection including up to 950K hours of open-source speech recordings covering the 24 official languages of the European Union. We collect data by surveying labeled and unlabeled speech corpora under open-source compliant licenses. In particular, MOSEL includes the automatic transcripts of 441k hours of unlabeled speech from VoxPopuli and LibriLight. The data is transcribed using Whisper large v3. The dataset is used for tasks such as automatic speech recognition and text-to-speech. The data is split into folders corresponding to the languages using ISO 639-1 codes, and each folder contains splits for pseudo-labeled datasets. The dataset includes fields such as id, language, text, and several hallmark fields to indicate specific characteristics of the text. The dataset is licensed under CC-BY-4.0.
提供机构:
FBK-MT



