MagicData

OpenDataLab2026-03-29 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/MagicData

下载链接

链接失效反馈

资源简介：

MAGICDATA普通话阅读语音语料库由MAGIC DATA开发科技有限公司，并免费发布用于非商业用途。语料库的内容和相应的描述包括：语料库包含 755 小时的语音数据，即主要是移动记录的数据。来自中国不同口音地区的1080位发言者是受邀参与录制。句子转录准确率高于98%。录音在安静的室内环境中进行。数据库分为训练集、验证集和测试以51：1：2的比例设置。语音数据编码和说话人信息等详细信息是保留在元数据文件中。记录文本的领域是多样化的，包括交互式问答、音乐搜索、SNS消息、家庭命令和控制等。还提供了分段的成绩单。该语料库旨在支持语音识别，机器方面的研究人员翻译、说话人识别和其他语音相关领域。因此语料库完全免费供学术使用。

MAGICDATA Mandarin Reading Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and released freely for non-commercial use. The content and corresponding descriptions of the corpus are as follows: The corpus contains 755 hours of speech data, mainly recorded via mobile devices. 1080 speakers from diverse accent regions in China were invited to participate in the recording. The sentence transcription accuracy rate exceeds 98%. All recordings were conducted in quiet indoor environments. The database is divided into training, validation, and test sets with a ratio of 51:1:2. Detailed information such as speech data encoding and speaker profiles is stored in metadata files. The domains of the transcribed texts are diverse, including interactive question answering, music search, SNS messages, home command and control, and more. Segmented transcriptions are also provided. This corpus is intended to support researchers in speech recognition, machine translation, speaker recognition and other speech-related fields. Therefore, the corpus is completely free for academic use.

提供机构：

OpenDataLab

创建时间：

2023-06-25

AI搜集汇总

数据集介绍