MATERIAL Kazakh-English Language Pack

Name: MATERIAL Kazakh-English Language Pack
Creator: Linguistic Data Consortium
Published: 2025-04-01 14:10:31
License: 暂无描述

DataCite Commons2025-04-01 更新2025-04-16 收录

下载链接：

http://catalog.ldc.upenn.edu/LDC2025S03

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> <p>MATERIAL Kazakh-English Language Pack, Linguistic Data Consortium Catalog Number LDC2025S03, was developed by <a href="http://www.appen.com/">Appen</a> for the IARPA (Intelligence Advanced Research Projects Activity) <a href= "https://www.iarpa.gov/index.php/research-programs/material">MATERIAL</a> (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations and queries.</p> <p>The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.</p> <h3>Data</h3> <p>The Kazakh speech in this release represents that spoken in the Northern and Southern dialect regions of Kazakhstan. Speakers were 18 years of age or older. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.</p> <p>Transcripts cover approximately 17% of the speech data, all of which was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release. </p> <p>Kazakh-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms.</p> <p>Speech data is presented mostly as two channel wav or single channel sphere files, both in 8kHz A-law format. Some wav files are 48kHz PCM. All text data is UTF-8 encoded.</p> <h3>Updates</h3> <p> Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at <a href="http://catalog.ldc.upenn.edu/LDC2025S03">LDC2025S03</a>. </p> <h3>Content Copyright</h3> <p>Portions © 2025 U.S. Government, © 2025 Trustees of the University of Pennsylvania</p> <p>The U.S. Government acquired this data from Appen which assigned the copyright to the data in the U.S. Government.</p> <hr> <p class="footer"> Contact: <a href="mailto:ldc@ldc.upenn.edu"> <b>ldc@ldc.upenn.edu</b></a><br> © 2021 <A HREF="http://www.ldc.upenn.edu"> <b>Linguistic Data Consortium</b></a>, <a href="http://www.upenn.edu"> <b>Trustees of the University of Pennsylvania</b></a>. All Rights Reserved. </p>

提供机构：

Linguistic Data Consortium

创建时间：

2025-04-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集