Korean Broadcast News Transcripts

Name: Korean Broadcast News Transcripts
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:19:02
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2006T14

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>Korean Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of transcripts for 13 hours of Voice of America satellite radio news broadcasts in Korean. The broadcasts were recorded by the LDC at transmission time during a two week period between January 21, 2000 and February 7, 2000.</p><br> <h3>Data</h3><br> <p>This corpus contains 18 broadcast transcription text files. Ten of the broadcasts are 30 minutes long, and the other eight broadcasts are 60 minutes long. The file names indicate the date (YYYYMMDD) and the begin and end times (HHMM EST) of the original transmission.</p><br> <p>The character encoding is Unicode UTF-8, and the file contents are structured using SGML. The markup strategy used here was defined by the National Institute of Standards and Technology (NIST) specifically for use in transcripts of broadcast news speech. The "docs" directory provides a working DTD file, a complete description (in the form of a PostScript file) of the document structure, tags and attributes, and a simple text file listing the 18 data file names in the corpus.</p><br> <p>The transcripts have been manually time aligned at the phrasal level and annotated to identify boundaries between news stories and speaker turns; speaker names and gender are given where identifiable. These annotations are all provided via the SGML tags and their attributes.</p><br> <p>A strong effort has been made to identify all unique speakers across the transcripts. However, there may be cases where an individual speaker has not been recognized and has been given a unique, anonymous identification.</p><br> <p>The corresponding speech files for these transcripts are available in <a href="../../../LDC2006S42">Korean Broadcast News Speech (LDC2006S42)</a>.</p><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2006T14.sgm">transcript sample (SGML)</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2000, 2006 Trustees of the University of Pennsylvania

## 简介本韩语广播新闻转录语料库（Korean Broadcast News Transcripts）由语言数据联盟（Linguistic Data Consortium）开发，包含13小时美国之音（Voice of America）韩语卫星广播新闻的转录文本。该广播于2000年1月21日至2月7日的两周内，由LDC在播出时段录制完成。 ## 数据说明本语料库共包含18份广播转录文本文件。其中10份广播时长为30分钟，剩余8份时长为60分钟。文件名标注了原始播出的日期（格式为YYYYMMDD）以及起始与结束时间（北美东部时区HHMM格式）。本语料库采用Unicode UTF-8字符编码，文件内容通过标准通用标记语言（Standard Generalized Markup Language，SGML）构建。本次使用的标记规范由美国国家标准与技术研究院（National Institute of Standards and Technology，NIST）专门制定，用于广播新闻语音转录文本。"docs"目录包含可用的文档类型定义（Document Type Definition，DTD）文件、一份以PostScript格式呈现的完整文档结构、标记及属性说明文档，以及一份列出语料库中18个数据文件名的简易文本文件。该转录文本已在短语层级完成人工时间对齐，并标注了新闻篇章与说话人轮次的边界；若可识别，还会标注说话人姓名与性别。所有标注信息均通过SGML标记及其属性提供。我们已尽最大努力识别语料库中所有唯一的说话人，但仍可能存在个别说话人未被识别，被赋予唯一匿名标识的情况。与该转录文本对应的语音文件可在<a href="../../../LDC2006S42">韩语广播新闻语音语料库（LDC2006S42）</a>中获取。 ## 示例请查看<a href="desc/addenda/LDC2006T14.sgm">该转录文本示例（SGML格式）</a>。 ## 更新情况暂无更新记录。部分内容 © 2000、2006 宾夕法尼亚大学托管委员会

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集