MATERIAL Kazakh-English Language Pack
收藏DataCite Commons2025-04-01 更新2025-04-16 收录
下载链接:
http://catalog.ldc.upenn.edu/LDC2025S03
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3>
<p>MATERIAL Kazakh-English Language Pack, Linguistic Data Consortium
Catalog Number LDC2025S03, was developed by <a
href="http://www.appen.com/">Appen</a> for the IARPA (Intelligence
Advanced Research Projects Activity)
<a href= "https://www.iarpa.gov/index.php/research-programs/material">MATERIAL</a>
(Machine Translation for English Retrieval of Information in Any Language)
program. It contains approximately 57 hours of Kazakh conversational
telephone speech, transcripts, English translations, annotations and queries.</p>
<p>The MATERIAL program focused on underserved languages with the ultimate
goal to build cross language information retrieval systems to find speech
and text content using English search queries.</p>
<h3>Data</h3>
<p>The Kazakh speech in this release represents that spoken in the Northern
and Southern dialect regions of Kazakhstan. Speakers were 18 years of age or older. Calls were made
using different telephones (e.g., mobile, landline) from a variety of environments
including the street, a home or office, a public place, and inside a vehicle.</p>
<p>Transcripts cover approximately 17% of the speech data, all of which was translated into English. Further
information about transcription and translation methodologies is contained
in the documentation accompanying this release. </p>
<p>Kazakh-English Language Pack also includes English queries and their
relevance annotations. Annotators marked transcripts by query (simple,
conceptual, hybrid) and by their relevance to query search terms.</p>
<p>Speech data is presented mostly as two channel wav or single channel
sphere files, both in 8kHz A-law format. Some wav files are 48kHz PCM.
All text data is UTF-8 encoded.</p>
<h3>Updates</h3>
<p>
Additional information, updates, bug fixes may be available in the LDC
catalog entry for this corpus at <a
href="http://catalog.ldc.upenn.edu/LDC2025S03">LDC2025S03</a>.
</p>
<h3>Content Copyright</h3>
<p>Portions © 2025 U.S. Government, © 2025 Trustees of the University of Pennsylvania</p>
<p>The U.S. Government acquired this data from Appen which assigned the copyright to the data in the U.S. Government.</p>
<hr>
<p class="footer">
Contact: <a href="mailto:ldc@ldc.upenn.edu">
<b>ldc@ldc.upenn.edu</b></a><br> © 2021 <A
HREF="http://www.ldc.upenn.edu">
<b>Linguistic Data Consortium</b></a>,
<a href="http://www.upenn.edu">
<b>Trustees of the University of Pennsylvania</b></a>. All Rights Reserved.
</p>
提供机构:
Linguistic Data Consortium
创建时间:
2025-04-01



