five

Turkish Broadcast News Speech and Transcripts

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2012S06
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Turkish Broadcast News Speech and Transcripts was developed by <a href="http://www.boun.edu.tr/en-US/Content/About_BU/History.aspx" rel="nofollow">Bogazi&ccedil;i University</a>, Istanbul, Turkey and contains approximatley 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications, such as speech retrieval.</p><br> <p>The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio the 2009 broadcasts were recorded from digitial satellite transmissions. A quick manual segmentation and transcription approach was followed.</p><br> <p>Speech recognition and retrieval experiments using the larger corpus can be found in the following journal article: Ebru Arisoy, Dogan Can, Siddika Parlak, Hasim Sak, and Murat Saraclar, Turkish Broadcast News Speech and Transcripts Transcription and Retrieval, IEEE Transactions on Audio, Speech and Language Processing, 17(5):874-883, July 2009.</p><br> <p>For more information please visit <a href="http://busim.ee.boun.edu.tr/~speech" rel="nofollow">http://busim.ee.boun.edu.tr/~speech</a> or contact the principal investigator, Murat Sara&ccedil;lar.</p><br> <h3>Data</h3><br> <p>The data was recrded at 32 kHz and resampled at 16 kHz. After screening for recording quality, the files were segmented, transcribed, and verified. The segmentation occurred in two steps, an initial automatic segmentation followed by manual correction and annotation which included information such as background conditions and speaker boundaries.</p><br> <p>The transcription guidelines were adapted from the LDC HUB4 and quick transcription guidelines. An English version of the adapted guidelines is provided with the data <a href="docs/LDC2012S06/RapidTrans.pdf" rel="nofollow">here</a>. The manual segmentations and transcripts were created by native Turkish speakers at Bo?azi&ccedil;i University using <a href="http://trans.sourceforge.net/en/presentation.php" rel="nofollow">Transcriber</a>. The transcriptions are provided in the ISO-8859-9 (Latin5) character set.</p><br> <h3>Samples</h3><br> <p>Please follow the links below for samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC2012S06.wav" rel="nofollow">Audio</a></li><br> <li><a href="desc/addenda/LDC2012S06.jpg" rel="nofollow">Transcript</a></li><br> </ul><br> <h3>Sponsorship</h3><br> <p>Funding for this corpus collection effort came from TUBITAK Project 105E102 and Bogazici University Research Fund Project 05HA202.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2012 Murat Saraçlar, Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作