---
annotations_creators:
- expert-generated
language:
- zh
language_creators:
- crowdsourced
license:
- cc-by-nc-nd-4.0
multilinguality:
- monolingual
pretty_name: MAGICDATA_Mandarin_Chinese_Read_Speech_Corpus
size_categories:
- 10K<n<100K
source_datasets:
- original
tags: []
task_categories:
- automatic-speech-recognition
task_ids: []
---
# Dataset Card for MMCRSC
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [MAGICDATA Mandarin Chinese Read Speech Corpus](https://openslr.org/68/)
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use.
The contents and the corresponding descriptions of the corpus include:
The corpus contains 755 hours of speech data, which is mostly mobile recorded data.
1080 speakers from different accent areas in China are invited to participate in the recording.
The sentence transcription accuracy is higher than 98%.
Recordings are conducted in a quiet indoor environment.
The database is divided into training set, validation set, and testing set in a ratio of 51: 1: 2.
Detail information such as speech data coding and speaker information is preserved in the metadata file.
The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command and control, etc.
Segmented transcripts are also provided.
The corpus aims to support researchers in speech recognition, machine translation, speaker recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use.
The corpus is a subset of a much bigger data ( 10566.9 hours Chinese Mandarin Speech Corpus ) set which was recorded in the same environment. Please feel free to contact us via business@magicdatatech.com for more details.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
zh-CN
## Dataset Structure
### Data Instances
```json
{
'file': '14_3466_20170826171404.wav',
'audio': {
'path': '14_3466_20170826171404.wav',
'array': array([0., 0., 0., ..., 0., 0., 0.]),
'sampling_rate': 16000
},
'text': '请搜索我附近的超市',
'speaker_id': 143466,
'id': '14_3466_20170826171404.wav'
}
```
### Data Fields
- file: A path to the downloaded audio file in .wav format.
- audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`.
- text: the transcription of the audio file.
- id: unique id of the data sample.
- speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples.
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
Please cite the corpus as "Magic Data Technology Co., Ltd., "http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/101", 05/2019".
annotations_creators:
- 专家生成
language:
- 中文(zh)
language_creators:
- 众包
license:
- CC BY-NC-ND 4.0
multilinguality:
- 单语言
pretty_name: MAGICDATA普通话朗读语音语料库(MAGICDATA_Mandarin_Chinese_Read_Speech_Corpus)
size_categories:
- 10千 < 样本数 < 100千
source_datasets:
- 原始数据集
tags: []
task_categories:
- 自动语音识别(automatic-speech-recognition)
task_ids: []
---
# MMCRSC数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概览](#dataset-summary)
- [支持任务与基准测试榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集创建](#dataset-creation)
- [数据集遴选依据](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集策展人](#dataset-curators)
- [许可证信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **主页:** [MAGICDATA普通话朗读语音语料库](https://openslr.org/68/)
- **代码仓库:**
- **论文:**
- **基准测试榜:**
- **联络人:**
### 数据集概览
MAGICDATA普通话朗读语音语料库由MAGIC DATA科技有限公司开发,免费面向非商业用途发布。
该语料库的内容及对应说明如下:
本语料库包含755小时语音数据,其中绝大多数为移动端录制数据。
邀请了来自中国不同方言口音区域的1080名发音人参与录制。
句子转录准确率高于98%。
录制均在安静的室内环境中完成。
该数据集按照51:1:2的比例划分为训练集、验证集与测试集。
语音数据编码、发音人信息等详细信息均存储于元数据文件中。
录制文本的覆盖领域多样,涵盖交互式问答、音乐搜索、社交网络消息、家庭指令控制等场景。
同时提供分段转录文本。
本语料库旨在为语音识别、机器翻译、说话人识别及其他语音相关领域的研究人员提供支持,因此面向学术研究完全免费。
本语料库是在相同录制环境下采集的更大规模数据集(10566.9小时普通话语音语料库)的子集。如需了解更多详情,可通过business@magicdatatech.com与我们联系。
### 支持任务与基准测试榜
[需要更多信息]
### 语言
简体中文(zh-CN)
## 数据集结构
### 数据实例
json
{
'file': '14_3466_20170826171404.wav',
'audio': {
'path': '14_3466_20170826171404.wav',
'array': array([0., 0., 0., ..., 0., 0., 0.]),
'sampling_rate': 16000
},
'text': '请搜索我附近的超市',
'speaker_id': 143466,
'id': '14_3466_20170826171404.wav'
}
### 数据字段
- file:指向下载的.wav格式音频文件的路径。
- audio:包含音频文件路径、解码后的音频数组以及采样率的字典。注意:访问音频列时,`dataset[0]["audio"]`会自动对音频文件进行解码,并将其重采样至`dataset.features["audio"].sampling_rate`指定的采样率。批量解码和重采样大量音频文件可能会耗费较多时间,因此建议优先通过样本索引访问音频列,即始终优先使用`dataset[0]["audio"]`而非`dataset["audio"][0]`。
- text:音频文件的转录文本。
- id:数据样本的唯一标识符。
- speaker_id:发音人的唯一标识符,同一发音人ID可对应多个数据样本。
### 数据划分
[需要更多信息]
## 数据集创建
### 数据集遴选依据
[需要更多信息]
### 源数据
#### 初始数据采集与标准化
[需要更多信息]
#### 源语言发声者是谁?
[需要更多信息]
### 标注
#### 标注流程
[需要更多信息]
#### 标注人员是谁?
[需要更多信息]
### 个人与敏感信息
[需要更多信息]
## 数据使用注意事项
### 数据集的社会影响
[需要更多信息]
### 偏差讨论
[需要更多信息]
### 其他已知局限性
[需要更多信息]
## 附加信息
### 数据集策展人
[需要更多信息]
### 许可证信息
[需要更多信息]
### 引用信息
请按照以下格式引用该语料库:"Magic Data Technology Co., Ltd., "http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/101", 05/2019"。
### 贡献
[需要更多信息]