The CMU Kids Corpus

Name: The CMU Kids Corpus
Creator: Linguistic Data Consortium
Published: 2025-03-06 08:55:09
License: 暂无描述

DataCite Commons2025-03-06 更新2024-07-13 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC97S63

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>This database is comprised of sentences read aloud by children. It was originally designed in order to create a training set of children's speech for the SPHINX II automatic speech recognizer for its use in the LISTEN project at Carnegie Mellon University.</p><br> <h3>Data</h3><br> <p>The children range in age from six to eleven (see details below) and were in first through third grades (the 11-year-old was in 6th grade) at the time of recording. There were 24 male and 52 female speakers. Although the girls outnumber the boys, we feel that the small difference in vocal tract length between the two at this age should make the effect of this imbalance negligible. There are 5,180 utterances in all.</p><br> <p>The speakers come from two separate populations. Since the LISTEN reading coach needed good examples of reading aloud, it was decided that the majority of the speakers should be "good" readers. They were recorded in the summer of 1995 and were enrolled in either the Chatham College Summer Camp or the Mount Lebanon Extended Day Summer Fun program in Pittsburgh. They were recorded on-site. This set will hereafter be called SUM95. There are 44 speakers and 3,333 utterances in this set. The LISTEN system also needed examples of errorful reading and dialectic variants. The readers who supplied this type of speech come from a school which has a high population of children who are at risk of growing up poor readers and who could therefore benefit from any reading tutor or other system built upon this database. They come from Fort Pitt School in Pittsburgh and were recorded in April 1996. This subset will be referred to as FP. There are 32 speakers and 1,847 utterances in this set. The list of speakers, the set they are in and the number of sentences per speaker can be found in the "tables" directory, in the file named "speaker.tbl."</p><br> <p>It should be noted that although there will be some dialectal variation in the speech of the SUM95 subset, the speech of the FP subset gives us a very good representation of dialects of the children that may be targeted for the LISTEN system. However, the user should be aware that the speakers' dialect partly reflects what is locally called "Pittsburghese."</p><br> <h3>Samples</h3><br> <p>Please view the following samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC97S63.sph">Audio Sample</a></li><br> <li><a href="desc/addenda/LDC97S63.trn">Transcript Sample</a></li><br> </ul><br> <h3>Updates</h3><br> <p>There are no updates at this time.</p></br> The text presented to the children was obtained from Weekly Reader stories. Weekly Reader is a four-page color reading supplement given out to children in many classrooms. Special reprint permission granted by Weekly Reader (R), published by Weekly Reader Corporation Copyright (c) 1994, 1995 by Weekly Reader Corporation All Rights Reserved.

<h3>引言</h3><br> <p>本数据集由儿童朗读的语句组成，最初为卡内基梅隆大学LISTEN项目中的SPHINX II自动语音识别器（SPHINX II automatic speech recognizer）构建儿童语音训练集而设计。</p><br> <h3>数据</h3><br> <p>本次录制的儿童年龄介于6至11岁（详见下文），就读于小学一年级至三年级（其中11岁儿童就读六年级）。本次数据集共包含24名男性发言者与52名女性发言者：尽管女性发言者人数多于男性，但考虑到该年龄段男女声道长度差异极小，这种性别比例失衡的影响可忽略不计。数据集总共有5180条语音语句。</p><br> <p>发言者来自两个独立的受试群体。由于LISTEN阅读辅导系统需要高质量的朗读范例，因此绝大多数发言者为“优秀”朗读者。这些儿童于1995年夏季录制，当时他们就读于匹兹堡市的查塔姆学院夏令营（Chatham College Summer Camp）或蒙特黎巴嫩日间拓展夏日娱乐项目（Mount Lebanon Extended Day Summer Fun program），录制工作在营地现场完成。该子集后续被命名为SUM95，包含44名发言者与3333条语音语句。</p><br> <p>LISTEN系统同时也需要包含朗读失误与方言变体的语音范例。提供该类语音的发言者来自一所学生群体中存在大量潜在阅读困难风险儿童的学校，这类儿童可从基于本数据集构建的阅读辅导系统或其他同类系统中获益。这些儿童来自匹兹堡的福特皮特学校（Fort Pitt School），录制工作于1996年4月完成，该子集被称为FP，包含32名发言者与1847条语音语句。发言者名单、所属子集以及每名发言者对应的语句数量，可在"tables"目录下的"speaker.tbl"文件中查询。</p><br> <p>需要说明的是，尽管SUM95子集的语音存在一定方言差异，但FP子集的语音能够很好地代表LISTEN系统目标受众儿童的方言特征。不过使用者需注意，这些发言者的方言一定程度上反映了当地俗称的“匹兹堡方言（Pittsburghese）”。</p><br> <h3>样本</h3><br> <p>请查看以下样本：</p><br> <ul><br> <li><a href="desc/addenda/LDC97S63.sph">语音样本</a></li><br> <li><a href="desc/addenda/LDC97S63.trn">转录文本样本</a></li><br> </ul><br> <h3>更新说明</h3><br> <p>目前暂无更新。</p><br> <p>本次用于儿童朗读的文本取材于《Weekly Reader》（Weekly Reader）的故事内容。《Weekly Reader》是一份面向众多课堂儿童发放的四页彩色阅读补充材料。经《Weekly Reader》（Weekly Reader (R)）授权许可进行特殊重印，版权所有©1994、1995年Weekly Reader Corporation，保留所有权利。</p>

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

搜集汇总

数据集介绍

背景与挑战

背景概述

The CMU Kids Corpus是一个儿童语音数据库，包含5,180个由6至11岁儿童朗读的句子，主要用于训练SPHINX II自动语音识别器在LISTEN项目中的应用。数据集分为两个子集：SUM95子集包含44名'好'读者的语音，而FP子集包含32名有阅读困难或方言变体（如'Pittsburghese'）的儿童的语音，旨在支持儿童语音识别系统的开发，并涵盖方言多样性。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集