MyST Children's Conversational Speech

Name: MyST Children's Conversational Speech
Creator: Linguistic Data Consortium
Published: 2021-06-15 16:59:45
License: 暂无描述

DataCite Commons2021-06-15 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2021S05

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>MyST (My Science Tutor) Children's Conversational Speech was developed by <a href="https://boulderlearning.com/">Boulder Learning Inc.</a> It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary.</p><br> <p>Data was collected in two phases between 2008 and 2017. In both phases, spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System (FOSS) system, a research-based science curriculum for grades K-8. The eight FOSS science modules represented in this data set consisted of an average of 16 small-group classroom science investigations. Following the investigations, students conversed with the virtual science tutor for 15-20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers.</p><br> <h3>Data</h3><br> <p>Speech data was collected in 10,496 sessions for a total of 227,567 utterances. Approximately 45% of those utterances (102,433) were transcribed. All data collected in Phase I was transcribed using rich transcription guidelines; data collected in Phase II was partially transcribed using a reduced version of those guidelines. The transcription guidelines are included in this release.</p><br> <p>Data is divided into development, test, and train partitions for use with ASR systems</p><br> <p>Speech is presented in single channel, 16kHz, 16-bit flac compressed wav format. Transcripts are UTF-8 encoded plain text.</p><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2021S05.flac">audio sample (FLAC)</a> and <a href="desc/addenda/LDC2021S05.txt">transcript sample (TXT)</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br>

<h3>引言</h3><br><p>MyST（我的科学导师，My Science Tutor）儿童会话语音数据集由<a href="https://boulderlearning.com/">博尔德学习公司（Boulder Learning Inc.）</a>开发。该数据集包含约470小时的英语语音数据，采集自1371名3至5年级学生与虚拟科学导师围绕8个科学教学领域展开的对话，同时附带转录文本与发音词典。</p><br><p>数据采集工作于2008年至2017年间分两个阶段完成。两个阶段中，与虚拟导师的口语对话均通过全选项科学系统（Full Option Science System, FOSS）与课堂教学进行对齐——该系统是一套面向K-8年级的循证科学课程。本数据集涵盖的8个FOSS科学模块，平均包含16个小组课堂科学探究活动。探究活动结束后，学生将与虚拟科学导师进行15至20分钟的对话：导师会针对屏幕上展示的媒体内容提出开放式问题，学生则以口语作答。</p><br><h3>数据</h3><br><p>本次共采集10496条会话的语音数据，总计227567条话语。其中约45%的话语（共102433条）已完成转录。第一阶段采集的所有数据均采用详细转录规范进行转录；第二阶段采集的数据则采用简化版规范完成部分转录。本发布版本中包含转录指南。</p><br><p>数据集被划分为开发集、测试集与训练集，以供自动语音识别（Automatic Speech Recognition, ASR）系统使用。</p><br><p>语音数据采用单通道、16kHz、16位FLAC压缩WAV格式存储；转录文本采用UTF-8编码的纯文本格式。</p><br><h3>示例</h3><br><p>请查看此<a href="desc/addenda/LDC2021S05.flac">音频示例（FLAC格式）</a>与<a href="desc/addenda/LDC2021S05.txt">转录文本示例（TXT格式）</a>。</p><br><h3>更新记录</h3><br><p>暂无更新记录。</p></br>

提供机构：

Linguistic Data Consortium

创建时间：

2021-06-07

搜集汇总

数据集介绍