MediaSpeech

Name: MediaSpeech
Creator: maas
Published: 2025-11-28 17:57:19
License: 暂无描述

魔搭社区2025-11-28 更新2025-11-29 收录

下载链接：

https://modelscope.cn/datasets/Flame4pd/MediaSpeech

下载链接

链接失效反馈

官方服务：

资源简介：

# MediaSpeech MediaSpeech is a dataset of Arabic, French, Spanish, and Turkish media speech built with the purpose of testing Automated Speech Recognition (ASR) systems performance. The dataset contains 10 hours of speech for each language provided. The dataset consists of short speech segments automatically extracted from media videos available on YouTube and manually transcribed, with some pre-processing and post-processing. Baseline models and WAV version of the dataset can be found in this [git repository](https://github.com/NTRLab/MediaSpeech). ## How to load the dataset The dataset has 4 languages: Arabic (`ar`), Spanish (`es`), French (`fr`), and Turkish (`tr`). To load a language portion of the dataset: ``` from datasets import load_dataset downloaded_dataset = load_dataset("ymoslem/MediaSpeech", "ar", split="train") ``` ## Dataset structure The dataset structure is as follows: ``` DatasetDict({ train: Dataset({ features: ['audio', 'sentence'], num_rows: 2505 }) }) ``` ## Citation To cite the dataset, use the following BibTeX entry: ``` @misc{mediaspeech2021, title={MediaSpeech: Multilanguage ASR Benchmark and Dataset}, author={Rostislav Kolobov and Olga Okhapkina and Olga Omelchishina, Andrey Platunov and Roman Bedyakin and Vyacheslav Moshkin and Dmitry Menshikov and Nikolay Mikhaylovskiy}, year={2021}, eprint={2103.16193}, archivePrefix={arXiv}, primaryClass={eess.AS} } ```

# MediaSpeech MediaSpeech是一款面向阿拉伯语、法语、西班牙语与土耳其语的媒体语音数据集，旨在评测自动语音识别（Automated Speech Recognition，ASR）系统的性能。该数据集为每种目标语言提供了10小时的语音数据。本数据集的语音片段均从YouTube平台的公开媒体视频中自动提取，并经人工转录，同时辅以标准化的预处理与后处理流程。该数据集的基线模型与WAV格式版本可在此Git仓库获取：https://github.com/NTRLab/MediaSpeech ## 如何加载数据集该数据集涵盖4种语言：阿拉伯语（`ar`）、西班牙语（`es`）、法语（`fr`）与土耳其语（`tr`）。加载指定语言子集的示例代码如下： from datasets import load_dataset downloaded_dataset = load_dataset("ymoslem/MediaSpeech", "ar", split="train") ## 数据集结构数据集结构如下所示： DatasetDict({ train: Dataset({ features: ['audio', 'sentence'], num_rows: 2505 }) }) ## 引用方式如需引用该数据集，请使用以下BibTeX条目： @misc{mediaspeech2021, title={MediaSpeech: Multilanguage ASR Benchmark and Dataset}, author={Rostislav Kolobov and Olga Okhapkina and Olga Omelchishina, Andrey Platunov and Roman Bedyakin and Vyacheslav Moshkin and Dmitry Menshikov and Nikolay Mikhaylovskiy}, year={2021}, eprint={2103.16193}, archivePrefix={arXiv}, primaryClass={eess.AS} }

提供机构：

maas

创建时间：

2025-11-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集