five

mls_sidon

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/sarulab-speech/mls_sidon
下载链接
链接失效反馈
官方服务:
资源简介:
# MLS-Sidon ## Overview This dataset is a **cleansed version of Multilingual LibriSpeech (MLS)** with **Sidon** speech restoration mode for **Speech Synthesis** and **Spoken Language Modeling**. The dataset is provided in **[WebDataset](https://github.com/webdataset/webdataset) format** for efficient large-scale training. - **Source**: [Multilingual LibriSpeech](https://www.openslr.org/94/) - **Languages**: English, German, French, Spanish, Italian, Polish, Dutch, Portuguese - **Format**: WebDataset (`.tar` shards) - **License**: [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) --- ## Dataset Structure Each sample in the dataset contains: - **`flac`** — audio file (48 kHz, single channel) - **`metadata.json`** *(optional)* — metadata including language, speaker ID, and original MLS reference Example (inside a `.tar` shard): ``` 000001.flac 000001.metadata.json 000002.flac 000002.metadata.json ... ```` --- ## How to Use ### With 🤗 Datasets You can load the WebDataset directly with Hugging Face’s `datasets` library: ```python import datasets from IPython.display import Audio from huggingface_hub import hf_hub_download import yaml base_url = "https://huggingface.co/datasets/sarulab-speech/mls_sidon/resolve/main/" language = 'english' split = 'test' data_file_path = hf_hub_download(repo_id="sarulab-speech/mls_sidon", repo_type="dataset", filename="paths.yaml") paths = yaml.load(open(data_file_path, "r"), Loader=yaml.FullLoader) ds = datasets.load_dataset("webdataset", data_files=[base_url + p for p in paths['english'][split]],streaming=True)['train'] sample = next(iter(ds)) audio = sample['flac'] print(sample['metadata.json']) Audio(audio['array'], rate=audio['sampling_rate']) ```` Replace `language` with the language (e.g., `english`, `german`). --- ## Citation If you use this dataset, please cite Sidon and the original MLS paper: ``` @misc{nakata2025sidonfastrobustopensource, title={Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing}, author={Wataru Nakata and Yuki Saito and Yota Ueda and Hiroshi Saruwatari}, year={2025}, eprint={2509.17052}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2509.17052}, } ``` ``` @inproceedings{pratap2020mls, title = {MLS: A Large-Scale Multilingual Dataset for Speech Research}, author = {Pratap, Vineel and Xu, Qiantong and Sriram, Anuroop and others}, booktitle = {Interspeech}, year = {2020} } ``` --- ## License This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). --- ## Acknowledgements * **Original data**: [Multilingual LibriSpeech (MLS)](https://www.openslr.org/94/)

# MLS-Sidon ## 概述 本数据集是**多语言LibriSpeech(Multilingual LibriSpeech, MLS)**的净化版本,搭载了适用于**语音合成(Speech Synthesis)**与**口语语言建模(Spoken Language Modeling)**的**Sidon**语音修复模式。 本数据集采用**WebDataset**格式进行存储,以支持高效的大规模训练。 - **来源**:[多语言LibriSpeech(Multilingual LibriSpeech, MLS)](https://www.openslr.org/94/) - **支持语言**:英语、德语、法语、西班牙语、意大利语、波兰语、荷兰语、葡萄牙语 - **数据格式**:WebDataset(`.tar`分卷文件) - **授权协议**:[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) --- ## 数据集结构 数据集的每条样本包含以下内容: - **`flac`** — 音频文件(采样率48 kHz,单声道) - **`metadata.json`**(可选)—— 元数据,包含语言、说话人ID以及原始MLS引用信息 示例(`.tar`分卷文件内结构): 000001.flac 000001.metadata.json 000002.flac 000002.metadata.json ... --- ## 使用方法 ### 使用🤗 Datasets库 您可以直接通过Hugging Face的`datasets`库加载该WebDataset数据集: python import datasets from IPython.display import Audio from huggingface_hub import hf_hub_download import yaml base_url = "https://huggingface.co/datasets/sarulab-speech/mls_sidon/resolve/main/" language = 'english' split = 'test' data_file_path = hf_hub_download(repo_id="sarulab-speech/mls_sidon", repo_type="dataset", filename="paths.yaml") paths = yaml.load(open(data_file_path, "r"), Loader=yaml.FullLoader) ds = datasets.load_dataset("webdataset", data_files=[base_url + p for p in paths['english'][split]], streaming=True)['train'] sample = next(iter(ds)) audio = sample['flac'] print(sample['metadata.json']) Audio(audio['array'], rate=audio['sampling_rate']) 将`language`替换为对应语言(例如`english`、`german`)。 --- ## 引用说明 若您使用本数据集,请引用Sidon以及原始MLS相关论文: bibtex @misc{nakata2025sidonfastrobustopensource, title={Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing}, author={Wataru Nakata and Yuki Saito and Yota Ueda and Hiroshi Saruwatari}, year={2025}, eprint={2509.17052}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2509.17052}, } bibtex @inproceedings{pratap2020mls, title = {MLS: A Large-Scale Multilingual Dataset for Speech Research}, author = {Pratap, Vineel and Xu, Qiantong and Sriram, Anuroop and others}, booktitle = {Interspeech}, year = {2020} } --- ## 授权协议 本数据集采用[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)协议发布。 --- ## 致谢 * **原始数据来源**:[多语言LibriSpeech(MLS)](https://www.openslr.org/94/)
提供机构:
maas
创建时间:
2025-10-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作