mosel
收藏魔搭社区2026-01-09 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/FBK-MT/mosel
下载链接
链接失效反馈官方服务:
资源简介:
<img src="./mosel-logo-transparent.png" align="center" width="100%">
### Dataset Description, Collection, and Source
The MOSEL corpus is a multilingual dataset collection including up to 950K hours of open-source speech recordings covering the 24 official languages of the European Union. We collect data by surveying labeled and unlabeled speech corpora under open-source compliant licenses.
In particular, MOSEL includes the automatic transcripts of 441k hours of unlabeled speech from VoxPopuli and LibriLight. The data is transcribed using [Whisper large v3](https://huggingface.co/openai/whisper-large-v3).
Whisper is released under the OS Apache 2.0 License which allows releasing the generated content under any license. Since LibriLight, differently from VoxPopuli, contains segments longer than Whisper's maximum duration limit of 30sec, we split them into chunks of up to 30sec.
- **Curated by:** Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, and Matteo Negri
- **Funded by:** FAIR, Meetween, and CINECA
- **Shared by:** Fondazione Bruno Kessler
### License
- CC-BY-4.0
### Dataset Sources
- **Collection Repository:** [MOSEL](https://github.com/hlt-mt/mosel)
- **Paper:** [MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages](http://arxiv.org/abs/2410.01036)
## Dataset Structure
### Data Config
The dataset is split into folders corresponding to the languages using the [2-letters ISO codes](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes), one for each language. Within each folder, a split for each psuedo-labeled dataset is provided.
### Data Field
`id`: alphanumeric identifier for the segment
`language`: extended language (e.g., "english")
`text`: the content of the psuedo label
`hall_repeated_ngrams`: True/False - indicates the repetition of an *n*-gram in `text` for a minimum number of times; for *n* in 1 to 2, the threshold is 4, for *n* in 3 to 5, it is 3
`hall_long_word`: True/False - indicates the presence of a word of at least 40 characters in `text`
`hall_frequent_single_word`: True/False - indicates that `text` consists of only one word which is the most frequent inside the whole text
## Dataset Statistics (in hours)
The following statistics refer to the whole MOSEL dataset:
| Language (LangID) | Labeled | Unlabeled | Total |
|--------|--------|--------|-------|
| Bulgarian (bg) | 111 | 17609 | 17720 |
| Croatian (hr) | 55 | 8106 | 8161 |
| Czech (cs) | 591 | 18705 | 19296 |
| Danish (da) | 20 | 13600 | 13620 |
| Dutch (nl) | 3395 | 19014 | 22409 |
| English (en) | 437239 | 84704 | 521943|
| Estonian (et) | 60 | 10604 | 10664 |
| Finnish (fi) | 64 | 14200 | 14264 |
| French (fr) | 26984 | 22896 | 49880 |
| German (de) | 9236 | 23228 | 32464 |
| Greek (el) | 35 | 17703 | 17738 |
| Hungarian (hu) | 189 | 17701 | 17890 |
| Irish (ga) | 17 | 0 | 17 |
| Italian (it) | 3756 | 21933 | 25689 |
| Latvian (lv) | 173 | 13100 | 13273 |
| Lithuanian (lt) | 36 | 14400 | 14436 |
| Maltese (mt) | 19 | 9100 | 9119 |
| Polish (pl) | 510 | 21207 | 21717 |
| Portuguese (pt) | 5492 | 17526 | 23018 |
| Romanian (ro) | 121 | 17906 | 18021 |
| Slovak (sk) | 61 | 12100 | 12161 |
| Slovenian (sl) | 32 | 11300 | 11332 |
| Spanish (es) | 17471 | 21526 | 38997 |
| Swedish (sv) | 58 | 16300 | 16358 |
| Total | 505725 | 444467 | 950192|
However, in this repo, there are transcripts only for a subset of the MOSEL data,
corresponding to:
| Language (LangID) | Hours |
|--------|--------|
| Bulgarian (bg) | 13892 |
| Croatian (hr) | 5276 |
| Czech (cs) | 14960 |
| Danish (da) | 10087 |
| Dutch (nl) | 12422 |
| English (en) | 184092 |
| Estonian (et) | 7974 |
| Finnish (fi) | 10687 |
| French (fr) | 20225 |
| German (de) | 19464 |
| Greek (el) | 10982 |
| Hungarian (hu) | 11660 |
| Irish (ga) | 0 |
| Italian (it) | 16713 |
| Latvian (lv) | 9311 |
| Lithuanian (lt) | 10770 |
| Maltese (mt) | 4010 |
| Polish (pl) | 16502 |
| Portuguese (pt) | 15434 |
| Romanian (ro) | 12377 |
| Slovak (sk) | 4458 |
| Slovenian (sl) | 5851 |
| Spanish (es) | 16970 |
| Swedish (sv) | 9918 |
| Total | 444035 |
## Dataset Creation
To reproduce the dataset creation, please refer to the [MOSEL README in the fbk-llm](https://github.com/hlt-mt/fbk-llm) repository.
The scripts used for hallucination detection are available in the `scripts` folder of this repository.
For version 2.0, the data has been curated with [NeMo-speech-data-processor](https://github.com/NVIDIA/NeMo-speech-data-processor).
## Changelog
### Version 2.0
- Part of Voxpopuli transcripts have been updated (see column `text_version_changed`).
- Added translation into English of non-English Voxpopuli transcripts without detected hallucinations.
Translations automatically judged of low quality (<0.6 according to QE scoring with EuroLLM)
- Added YouTube-Commons transcripts.
### Version 1.1
- Adds missing Croatian data in Voxpopuli.
## Citation
Release 1.0:
```
@inproceedings{mosel,
title = {{MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages}},
author = {Marco Gaido and Sara Papi and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabih and Matteo Negri},
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, United States",
publisher = "Association for Computational Linguistics",
}
```
For changes in version 2.0:
```
@inproceedings{granary,
title = {{Granary: Speech Recognition and Translation Dataset in 25 European Languages}},
author = {{Nithin Rao} Koluguri and Monica Sekoyan and George Zelenfroynd and Sasha Meister and Shuoyang Ding and Sofia Kostandian and He Huang and Nikolay Karpov and Jagadeesh Balam and Vitaly Lavrukhin and Yifan Peng and Sara Papi and Marco Gaido and Alessio Brutti and Boris Ginsburg},
booktitle = "Proc. of Interspeech 2025",
month = aug,
year = "2025",
address = "Rotterdam, Netherlands",
}
```
## Dataset Card Contact
[@spapi](https://huggingface.co/spapi)
<img src="./mosel-logo-transparent.png" align="center" width="100%">
### 数据集描述、采集与来源
MOSEL语料库是一套多语言数据集集合,包含多达95万小时的开源语音录音,覆盖欧盟全部24种官方语言。本数据集通过调研遵循开源兼容许可协议的带标注与无标注语音语料库完成采集。
特别地,MOSEL包含来自VoxPopuli与LibriLight的44.1万小时无标注语音的自动转录文本,转录工作基于[Whisper large v3](https://huggingface.co/openai/whisper-large-v3)完成。
Whisper采用Apache 2.0许可协议发布,该协议允许将生成的内容以任意许可协议进行发布。由于LibriLight与VoxPopuli不同,其包含的语音片段时长超过Whisper设定的30秒最大限制,因此我们将其切割为最长30秒的分块。
- **整理方:** Marco Gaido、Sara Papi、Luisa Bentivogli、Alessio Brutti、Mauro Cettolo、Roberto Gretter、Marco Matassoni、Mohamed Nabih 与 Matteo Negri
- **资助方:** FAIR、Meetween 及 CINECA
- **共享方:** 布鲁诺·凯瑟基金会(Fondazione Bruno Kessler)
### 许可协议
- CC-BY-4.0
### 数据集来源
- **采集仓库:** [MOSEL](https://github.com/hlt-mt/mosel)
- **相关论文:** [MOSEL:面向欧盟语言开源语音基础模型训练的95万小时语音数据集](http://arxiv.org/abs/2410.01036)
## 数据集结构
### 数据配置
该数据集按照语言对应的[2位ISO 639语言代码](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)划分至独立文件夹,每种语言对应一个专属文件夹。每个文件夹内均提供了对应伪标注数据集的划分子集。
### 数据字段
`id`:用于标识语音片段的字母数字标识符
`language`:扩展语言名称(例如"english")
`text`:伪标注对应的文本内容
`hall_repeated_ngrams`:布尔值(True/False),用于标识`text`中存在至少指定次数的n元组重复:当n取值为1至2时,重复阈值为4次;当n取值为3至5时,重复阈值为3次
`hall_long_word`:布尔值(True/False),用于标识`text`中存在长度至少为40个字符的单词
`hall_frequent_single_word`:布尔值(True/False),用于标识`text`仅由单个单词构成,且该单词为全语料库中出现频率最高的单词
## 数据集统计(单位:小时)
以下统计数据针对完整MOSEL数据集:
| 语言(语言代码) | 带标注数据 | 无标注数据 | 总计 |
|--------|--------|--------|-------|
| 保加利亚语(bg) | 111 | 17609 | 17720 |
| 克罗地亚语(hr) | 55 | 8106 | 8161 |
| 捷克语(cs) | 591 | 18705 | 19296 |
| 丹麦语(da) | 20 | 13600 | 13620 |
| 荷兰语(nl) | 3395 | 19014 | 22409 |
| 英语(en) | 437239 | 84704 | 521943|
| 爱沙尼亚语(et) | 60 | 10604 | 10664 |
| 芬兰语(fi) | 64 | 14200 | 14264 |
| 法语(fr) | 26984 | 22896 | 49880 |
| 德语(de) | 9236 | 23228 | 32464 |
| 希腊语(el) | 35 | 17703 | 17738 |
| 匈牙利语(hu) | 189 | 17701 | 17890 |
| 爱尔兰语(ga) | 17 | 0 | 17 |
| 意大利语(it) | 3756 | 21933 | 25689 |
| 拉脱维亚语(lv) | 173 | 13100 | 13273 |
| 立陶宛语(lt) | 36 | 14400 | 14436 |
| 马耳他语(mt) | 19 | 9100 | 9119 |
| 波兰语(pl) | 510 | 21207 | 21717 |
| 葡萄牙语(pt) | 5492 | 17526 | 23018 |
| 罗马尼亚语(ro) | 121 | 17906 | 18021 |
| 斯洛伐克语(sk) | 61 | 12100 | 12161 |
| 斯洛文尼亚语(sl) | 32 | 11300 | 11332 |
| 西班牙语(es) | 17471 | 21526 | 38997 |
| 瑞典语(sv) | 58 | 16300 | 16358 |
| 总计 | 505725 | 444467 | 950192|
不过本仓库仅提供了MOSEL数据集子集的转录文本,对应语种及时长如下:
| 语言(语言代码) | 时长(小时) |
|--------|--------|
| 保加利亚语(bg) | 13892 |
| 克罗地亚语(hr) | 5276 |
| 捷克语(cs) | 14960 |
| 丹麦语(da) | 10087 |
| 荷兰语(nl) | 12422 |
| 英语(en) | 184092 |
| 爱沙尼亚语(et) | 7974 |
| 芬兰语(fi) | 10687 |
| 法语(fr) | 20225 |
| 德语(de) | 19464 |
| 希腊语(el) | 10982 |
| 匈牙利语(hu) | 11660 |
| 爱尔兰语(ga) | 0 |
| 意大利语(it) | 16713 |
| 拉脱维亚语(lv) | 9311 |
| 立陶宛语(lt) | 10770 |
| 马耳他语(mt) | 4010 |
| 波兰语(pl) | 16502 |
| 葡萄牙语(pt) | 15434 |
| 罗马尼亚语(ro) | 12377 |
| 斯洛伐克语(sk) | 4458 |
| 斯洛文尼亚语(sl) | 5851 |
| 西班牙语(es) | 16970 |
| 瑞典语(sv) | 9918 |
| 总计 | 444035 |
## 数据集构建
如需复现数据集构建流程,请参考[fbk-llm仓库中的MOSEL README文档](https://github.com/hlt-mt/fbk-llm)。本仓库的`scripts`文件夹中提供了用于幻觉检测的相关脚本。
针对2.0版本,数据集使用[NeMo-speech-data-processor](https://github.com/NVIDIA/NeMo-speech-data-processor)完成整理。
## 更新日志
### 版本2.0
- 更新了部分Voxpopuli的转录文本(详见`text_version_changed`列)。
- 为未检测到幻觉的非英语Voxpopuli转录文本添加了英语译文。译文经自动评估质量较低(基于EuroLLM的QE评分,得分<0.6)。
- 新增了YouTube-Commons转录文本。
### 版本1.1
- 补充了Voxpopuli中缺失的克罗地亚语数据。
## 引用
1.0版本引用:
@inproceedings{mosel,
title = {{MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages}},
author = {Marco Gaido and Sara Papi and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabih and Matteo Negri},
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, United States",
publisher = "Association for Computational Linguistics",
}
针对2.0版本变更的引用:
@inproceedings{granary,
title = {{Granary: Speech Recognition and Translation Dataset in 25 European Languages}},
author = {{Nithin Rao} Koluguri and Monica Sekoyan and George Zelenfroynd and Sasha Meister and Shuoyang Ding and Sofia Kostandian and He Huang and Nikolay Karpov and Jagadeesh Balam and Vitaly Lavrukhin and Yifan Peng and Sara Papi and Marco Gaido and Alessio Brutti and Boris Ginsburg},
booktitle = "Proc. of Interspeech 2025",
month = aug,
year = "2025",
address = "Rotterdam, Netherlands",
}
## 数据集卡片联系人
[@spapi](https://huggingface.co/spapi)
提供机构:
maas
创建时间:
2025-09-26



