mls_sidon
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/sarulab-speech/mls_sidon
下载链接
链接失效反馈官方服务:
资源简介:
# MLS-Sidon
## Overview
This dataset is a **cleansed version of Multilingual LibriSpeech (MLS)** with **Sidon** speech restoration mode for **Speech Synthesis** and **Spoken Language Modeling**.
The dataset is provided in **[WebDataset](https://github.com/webdataset/webdataset) format** for efficient large-scale training.
- **Source**: [Multilingual LibriSpeech](https://www.openslr.org/94/)
- **Languages**: English, German, French, Spanish, Italian, Polish, Dutch, Portuguese
- **Format**: WebDataset (`.tar` shards)
- **License**: [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
---
## Dataset Structure
Each sample in the dataset contains:
- **`flac`** — audio file (48 kHz, single channel)
- **`metadata.json`** *(optional)* — metadata including language, speaker ID, and original MLS reference
Example (inside a `.tar` shard):
```
000001.flac
000001.metadata.json
000002.flac
000002.metadata.json
...
````
---
## How to Use
### With 🤗 Datasets
You can load the WebDataset directly with Hugging Face’s `datasets` library:
```python
import datasets
from IPython.display import Audio
from huggingface_hub import hf_hub_download
import yaml
base_url = "https://huggingface.co/datasets/sarulab-speech/mls_sidon/resolve/main/"
language = 'english'
split = 'test'
data_file_path = hf_hub_download(repo_id="sarulab-speech/mls_sidon", repo_type="dataset", filename="paths.yaml")
paths = yaml.load(open(data_file_path, "r"), Loader=yaml.FullLoader)
ds = datasets.load_dataset("webdataset", data_files=[base_url + p for p in paths['english'][split]],streaming=True)['train']
sample = next(iter(ds))
audio = sample['flac']
print(sample['metadata.json'])
Audio(audio['array'], rate=audio['sampling_rate'])
````
Replace `language` with the language (e.g., `english`, `german`).
---
## Citation
If you use this dataset, please cite Sidon and the original MLS paper:
```
@misc{nakata2025sidonfastrobustopensource,
title={Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing},
author={Wataru Nakata and Yuki Saito and Yota Ueda and Hiroshi Saruwatari},
year={2025},
eprint={2509.17052},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2509.17052},
}
```
```
@inproceedings{pratap2020mls,
title = {MLS: A Large-Scale Multilingual Dataset for Speech Research},
author = {Pratap, Vineel and Xu, Qiantong and Sriram, Anuroop and others},
booktitle = {Interspeech},
year = {2020}
}
```
---
## License
This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/).
---
## Acknowledgements
* **Original data**: [Multilingual LibriSpeech (MLS)](https://www.openslr.org/94/)
# MLS-Sidon
## 概述
本数据集是**多语言LibriSpeech(Multilingual LibriSpeech, MLS)**的净化版本,搭载了适用于**语音合成(Speech Synthesis)**与**口语语言建模(Spoken Language Modeling)**的**Sidon**语音修复模式。
本数据集采用**WebDataset**格式进行存储,以支持高效的大规模训练。
- **来源**:[多语言LibriSpeech(Multilingual LibriSpeech, MLS)](https://www.openslr.org/94/)
- **支持语言**:英语、德语、法语、西班牙语、意大利语、波兰语、荷兰语、葡萄牙语
- **数据格式**:WebDataset(`.tar`分卷文件)
- **授权协议**:[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
---
## 数据集结构
数据集的每条样本包含以下内容:
- **`flac`** — 音频文件(采样率48 kHz,单声道)
- **`metadata.json`**(可选)—— 元数据,包含语言、说话人ID以及原始MLS引用信息
示例(`.tar`分卷文件内结构):
000001.flac
000001.metadata.json
000002.flac
000002.metadata.json
...
---
## 使用方法
### 使用🤗 Datasets库
您可以直接通过Hugging Face的`datasets`库加载该WebDataset数据集:
python
import datasets
from IPython.display import Audio
from huggingface_hub import hf_hub_download
import yaml
base_url = "https://huggingface.co/datasets/sarulab-speech/mls_sidon/resolve/main/"
language = 'english'
split = 'test'
data_file_path = hf_hub_download(repo_id="sarulab-speech/mls_sidon", repo_type="dataset", filename="paths.yaml")
paths = yaml.load(open(data_file_path, "r"), Loader=yaml.FullLoader)
ds = datasets.load_dataset("webdataset", data_files=[base_url + p for p in paths['english'][split]], streaming=True)['train']
sample = next(iter(ds))
audio = sample['flac']
print(sample['metadata.json'])
Audio(audio['array'], rate=audio['sampling_rate'])
将`language`替换为对应语言(例如`english`、`german`)。
---
## 引用说明
若您使用本数据集,请引用Sidon以及原始MLS相关论文:
bibtex
@misc{nakata2025sidonfastrobustopensource,
title={Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing},
author={Wataru Nakata and Yuki Saito and Yota Ueda and Hiroshi Saruwatari},
year={2025},
eprint={2509.17052},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2509.17052},
}
bibtex
@inproceedings{pratap2020mls,
title = {MLS: A Large-Scale Multilingual Dataset for Speech Research},
author = {Pratap, Vineel and Xu, Qiantong and Sriram, Anuroop and others},
booktitle = {Interspeech},
year = {2020}
}
---
## 授权协议
本数据集采用[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)协议发布。
---
## 致谢
* **原始数据来源**:[多语言LibriSpeech(MLS)](https://www.openslr.org/94/)
提供机构:
maas
创建时间:
2025-10-13



