ChildMandarin
收藏魔搭社区2026-05-07 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/BAAI/ChildMandarin
下载链接
链接失效反馈官方服务:
资源简介:
# ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5
[](https://huggingface.co/datasets/BAAI/ChildMandarin)
[](https://arxiv.org/abs/2409.18584)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
[](https://github.com/flageval-baai/ChildMandarin)
## Introduction
**ChildMandarin** is a comprehensive, open-source Mandarin Chinese speech dataset specifically designed for research on young children aged 3 to 5. This dataset addresses the critical lack of publicly available resources for this age group, enabling advancements in automatic speech recognition (ASR), speaker verification (SV), and other related fields. The dataset is released under a **CC BY-NC-SA 4.0 license**, meaning it is available for non-commercial use.
## Dataset Details
This dataset contains 41.25 hours of high-quality speech data collected from 397 children across 22 provinces in China. Key features of the dataset include:
* **Age Range:** 3-5 years old (inclusive). This is a crucial age range often overlooked in speech datasets.
* **Speakers:** 397 unique child speakers.
* **Geographic Diversity:** Speakers from 22 of China's 34 provincial-level administrative divisions, capturing a range of regional accents.
* **Gender Balance:** Approximately equal representation of male and female speakers across all age groups.
* **Recording Conditions:** Recordings were made in quiet environments using a variety of smartphones (both Android and iPhone devices) to ensure real-world applicability.
* **Content:** Natural, conversational speech during age-appropriate activities. The content is unrestricted, promoting spontaneous and natural interactions.
* **Audio Format:** WAV files with a 16kHz sampling rate.
* **Transcriptions:** Carefully crafted, character-level manual transcriptions.
* **Annotations:** The dataset includes annotations for each utterance, and for the speakers level.
* **Utterance-level**: `id`, `audio` (file path), `text` (transcription).
* **Speaker-level**: `speaker_id`, `age`, `gender`, `accent`, `location` (province), `device`.
### Dataset Structure
The dataset is split into three subsets:
| Split | # Speakers | # Utterances | Duration (hrs) | Avg. Utterance Length (s) |
| :--------- | :--------: | :----------: | :------------: | :-----------------------: |
| `train` | 317 | 32,658 | 33.35 | 3.68 |
| `validation` | 39 | 4,057 | 3.78 | 3.35 |
| `test` | 41 | 4,198 | 4.12 | 3.53 |
| **Total** | **397** | **40,913** | **41.25** | **3.52** |
The dataset file structure is as follows.
```
data/
├── train/*.tar
├── dev/*.tar
└── test/*.tar
speaker_info.xlsx # summary of speaker information
```
Each WAV file has a corresponding JSON file with the same name, containing its annotations.
For more details, please refer to our paper [ChildMandarin](https://arxiv.org/abs/2409.18584).
## 📚 Cite me
```
@article{zhou2024childmandarin,
title={ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5},
author={Zhou, Jiaming and Wang, Shiyao and Zhao, Shiwan and He, Jiabei and Sun, Haoqin and Wang, Hui and Liu, Cheng and Kong, Aobo and Guo, Yujie and Qin, Yong},
journal={arXiv preprint arXiv:2409.18584},
year={2024}
}
```
# ChildMandarin:面向3-5岁幼儿的综合普通话语音数据集
[](https://huggingface.co/datasets/BAAI/ChildMandarin)
[](https://arxiv.org/abs/2409.18584)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
[](https://github.com/flageval-baai/ChildMandarin)
## 简介
**ChildMandarin**是一款专为3至5岁幼儿研究设计的开源综合普通话语音数据集。本数据集填补了该年龄段公开可用语音资源的关键空白,可为自动语音识别(Automatic Speech Recognition, ASR)、说话人验证(Speaker Verification, SV)及其他相关领域的研究进展提供支撑。本数据集采用**CC BY-NC-SA 4.0**许可协议发布,仅可用于非商业用途。
## 数据集详情
本数据集包含来自中国22个省份的397名儿童的总计41.25小时高质量语音数据。其核心特性如下:
* **年龄范围**:3至5岁(含两端)。该年龄段常被现有语音数据集所忽视,是至关重要的研究群体。
* **说话人数量**:397名独立儿童说话人。
* **地域多样性**:覆盖中国34个省级行政区中的22个,囊括了多种地域口音。
* **性别均衡性**:各年龄组的男性与女性说话人占比大致相当。
* **录制环境**:在安静环境下使用多款智能手机(涵盖Android与iPhone设备)进行录制,以确保数据集具备真实场景适用性。
* **语音内容**:贴合幼儿年龄的活动中产生的自然会话语音。内容无预设限制,可促进自发且自然的交互。
* **音频格式**:采用采样率16kHz的WAV格式文件。
* **转录文本**:经过精心制作的字符级人工转录文本。
* **标注信息**:数据集包含每一条语音片段以及说话人维度的标注:
* **语音片段级标注**:`id`、`audio`(文件路径)、`text`(转录文本)。
* **说话人级标注**:`speaker_id`、`age`、`gender`、`accent`、`location`(省份)、`device`。
### 数据集结构
本数据集分为三个子集:
| 划分集 | 说话人数量 | 语音片段数 | 时长(小时) | 平均语音片段时长(秒) |
| :--------- | :--------: | :----------: | :------------: | :-----------------------: |
| `train` | 317 | 32,658 | 33.35 | 3.68 |
| `validation` | 39 | 4,057 | 3.78 | 3.35 |
| `test` | 41 | 4,198 | 4.12 | 3.53 |
| **总计** | **397** | **40,913** | **41.25** | **3.52** |
本数据集的文件组织形式如下:
data/
├── train/*.tar
├── dev/*.tar
└── test/*.tar
speaker_info.xlsx # 说话人信息汇总表
每个WAV音频文件都对应一个同名JSON文件,包含其标注信息。
如需了解更多细节,请参阅我们的论文[ChildMandarin](https://arxiv.org/abs/2409.18584)。
## 📚 引用本数据集
@article{zhou2024childmandarin,
title={ChildMandarin: 面向3-5岁幼儿的综合普通话语音数据集},
author={Zhou, Jiaming and Wang, Shiyao and Zhao, Shiwan and He, Jiabei and Sun, Haoqin and Wang, Hui and Liu, Cheng and Kong, Aobo and Guo, Yujie and Qin, Yong},
journal={arXiv预印本 arXiv:2409.18584},
year={2024}
}
提供机构:
maas
创建时间:
2025-03-18
搜集汇总
数据集介绍

背景与挑战
背景概述
ChildMandarin是一个针对3-5岁儿童的中文普通话语音数据集,包含41.25小时的语音数据,来自397名儿童,覆盖中国22个省份。数据集具有年龄范围、地理多样性和性别平衡等特点,适用于自动语音识别和说话人验证等研究。
以上内容由遇见数据集搜集并总结生成



