Arabic Broadcast News Transcripts
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2006T20
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of 10 hours of transcribed speech from Voice of America satellite radio news broadcasts in Arabic recorded by LDC between June 2000 and January 2001. The corresponding speech files are available in <a href="../../LDC2006S46">Arabic Broadcast News Speech (LDC2006S46)</a>.</p><br>
<p>This work was undertaken in the Networking Data Centers (NetDC) project (MLIS-5017, NSF IIS-9982201) in conjunction with the <a href="http://www.elda.org/en/">European Language Resources Association</a> (ELRA). ELRA transcribed 22.5 hours of Arabic broadcast data from Radio Orient (France) that is available in <a href="http://catalog.elra.info/product_info.php?products_id=13">NetDC Arabic BNSC (Broadcast News Speech Corpus) (ELRA-S0157)</a>. The goal of the NetDC project was to improve the infrastructure for language resources by designing and implementing new modes of cooperation between LDC and ELRA.</p><br>
<h3>Data</h3><br>
<p>The character encoding is entirely in ASCII; Buckwalter transliteration is used for rendering the Arabic text content. Time alignment and structural markup are rendered via "pseudo-SGML" tags, which are presented one tag per line, with the first character of the line being an open angle bracket.</p><br>
<p>The lines of transcription text (i.e. the speech and annotation content between the time-stamp tags) all begin with a single space character, and present exactly one token per line. A "token" may be a spoken Arabic word, a punctuation mark, or a single Arabic word enclosed by "(%" and ")", which represents an annotation of a non-speech condition or event (e.g. "music", "noise", "laugh", etc).</p><br>
<h3>Updates</h3><br>
<p>None at this time.</p><br>
<h3>Samples</h3><br>
<p>Please view this <a href="desc/addenda/LDC2006T20.txt">transcript sample</a>.</p></br>
Portions © 2000, 2001, 2002, 2005, 2006 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



