MediaSpeech
收藏魔搭社区2025-11-28 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/Flame4pd/MediaSpeech
下载链接
链接失效反馈官方服务:
资源简介:
# MediaSpeech
MediaSpeech is a dataset of Arabic, French, Spanish, and Turkish media speech built with the purpose of testing Automated Speech Recognition (ASR) systems performance. The dataset contains 10 hours of speech for each language provided.
The dataset consists of short speech segments automatically extracted from media videos available on YouTube and manually transcribed, with some pre-processing and post-processing.
Baseline models and WAV version of the dataset can be found in this [git repository](https://github.com/NTRLab/MediaSpeech).
## How to load the dataset
The dataset has 4 languages: Arabic (`ar`), Spanish (`es`), French (`fr`), and Turkish (`tr`). To load a language portion of the dataset:
```
from datasets import load_dataset
downloaded_dataset = load_dataset("ymoslem/MediaSpeech", "ar", split="train")
```
## Dataset structure
The dataset structure is as follows:
```
DatasetDict({
train: Dataset({
features: ['audio', 'sentence'],
num_rows: 2505
})
})
```
## Citation
To cite the dataset, use the following BibTeX entry:
```
@misc{mediaspeech2021,
title={MediaSpeech: Multilanguage ASR Benchmark and Dataset},
author={Rostislav Kolobov and Olga Okhapkina and Olga Omelchishina, Andrey Platunov and Roman Bedyakin and Vyacheslav Moshkin and Dmitry Menshikov and Nikolay Mikhaylovskiy},
year={2021},
eprint={2103.16193},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
```
# MediaSpeech
MediaSpeech是一款面向阿拉伯语、法语、西班牙语与土耳其语的媒体语音数据集,旨在评测自动语音识别(Automated Speech Recognition,ASR)系统的性能。该数据集为每种目标语言提供了10小时的语音数据。
本数据集的语音片段均从YouTube平台的公开媒体视频中自动提取,并经人工转录,同时辅以标准化的预处理与后处理流程。
该数据集的基线模型与WAV格式版本可在此Git仓库获取:https://github.com/NTRLab/MediaSpeech
## 如何加载数据集
该数据集涵盖4种语言:阿拉伯语(`ar`)、西班牙语(`es`)、法语(`fr`)与土耳其语(`tr`)。加载指定语言子集的示例代码如下:
from datasets import load_dataset
downloaded_dataset = load_dataset("ymoslem/MediaSpeech", "ar", split="train")
## 数据集结构
数据集结构如下所示:
DatasetDict({
train: Dataset({
features: ['audio', 'sentence'],
num_rows: 2505
})
})
## 引用方式
如需引用该数据集,请使用以下BibTeX条目:
@misc{mediaspeech2021,
title={MediaSpeech: Multilanguage ASR Benchmark and Dataset},
author={Rostislav Kolobov and Olga Okhapkina and Olga Omelchishina, Andrey Platunov and Roman Bedyakin and Vyacheslav Moshkin and Dmitry Menshikov and Nikolay Mikhaylovskiy},
year={2021},
eprint={2103.16193},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
提供机构:
maas
创建时间:
2025-11-27



