NIST Meeting Pilot Corpus Transcripts and Metadata

Name: NIST Meeting Pilot Corpus Transcripts and Metadata
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:16:53
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2004T13

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> <p>NIST Meeting Pilot Corpus Transcripts and Metadata was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T13 and ISBN 1-58563-303-8. </p><p>This corpus contains the full speech transcripts created by the Linguistic Data Consortium for the NIST Automatic Meeting Recognition Project as well as a metadata database with useful information about the meeting forums, topics, participants and recording conditions and equipment. The corresponding speech files are available as the <a href="http://catalog.ldc.upenn.edu/LDC2004S09" rel="nofollow">NIST Meeting Pilot Corpus Speech</a>, while the video files will be published later as NIST Meeting Pilot Corpus Video. </p><p>For more information, documentation, and updates made after the release of this corpus, please consult the <a href="http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1" rel="nofollow"> NIST project website</a> for the corpus. </p><h3>Data</h3> <p>The data for the NIST Automatic Meeting Recognition Project was collected at the NIST Gaithersburg, MD Meeting Data Collection Laboratory and includes 19 meetings (comprising about 15 hours of data) recorded between November 2001 and December 2003. </p><p>The full transcriptions included in this release were created using a "quick" transcription procedure. There are ~151K-words and 6K unique words. A variety of information was manually recorded during the collection of the pilot corpus about the subjects and recording setup. This information was stored in a relational database. A fully-updated online version of the database is available from the <a href="http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/recordings/index.html" rel="nofollow">NIST project website</a>. </p><h3>Updates</h3> <p>There are no updates available at this time. </p> </br> Portions © 2004 Trustees of the University of Pennsylvania

<h3>引言</h3> <p>NIST会议试点语料库转录文本与元数据（NIST Meeting Pilot Corpus Transcripts and Metadata）由语言数据联盟（Linguistic Data Consortium，LDC）制作，目录号为LDC2004T13，ISBN号为1-58563-303-8。</p><p>该语料库包含语言数据联盟为NIST自动会议识别项目（Automatic Meeting Recognition Project）创建的完整语音转录文本，以及一个元数据库，其中包含关于会议论坛、主题、参与者、录音条件和设备的实用信息。对应的语音文件可通过<a href="http://catalog.ldc.upenn.edu/LDC2004S09" rel="nofollow">NIST Meeting Pilot Corpus Speech</a>获取，而视频文件将后续以NIST Meeting Pilot Corpus Video的形式发布。</p><p>如需获取更多信息、文档以及该语料库发布后的更新内容，请查阅该语料库的<a href="http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1" rel="nofollow">NIST项目网站</a>。</p><h3>数据内容</h3> <p>NIST自动会议识别项目的数据收集于美国马里兰州盖瑟斯堡的NIST会议数据收集实验室，包含2001年11月至2003年12月期间录制的19个会议（总计约15小时数据）。</p><p>本版本包含的完整转录文本通过“快速”转录流程生成，包含约15.1万个词汇和6000个独特词汇。在试点语料库收集过程中，研究人员手动记录了关于受试者和录音设置的各类信息，这些信息存储在一个关系型数据库（relational database）中。该数据库的完整更新在线版本可从<a href="http://www.nist.gov/speech/test_beds/mr_proj/meeting_corpus_1/recordings/index.html" rel="nofollow">NIST项目网站</a>获取。</p><h3>更新说明</h3> <p>目前暂无可用更新。</p> </br> Portions © 2004 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集