MATERIAL Farsi-English Language Pack
收藏DataCite Commons2025-06-03 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2024S13
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3>
<p>MATERIAL Farsi-English Language Pack was developed by <a href="http://www.appen.com/">Appen</a> for the IARPA (Intelligence Advanced Research Projects Activity) <a href="https://www.iarpa.gov/index.php/research-programs/material">MATERIAL</a> (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations and queries.</p>
<p>The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.</p>
<h3>Data</h3>
<p>The Farsi speech in this release represents that spoken in the Greater Tehran, Central/Southwest, Northeast, and Northwest dialect regions of Iran, as well as a standard formal dialect in use throughout the country. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.</p>
<p>Transcripts cover approximately a third of the speech data, and approximately 3% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release.</p>
<p>Farsi-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms.</p>
<p>Speech data is presented either as two channel wav or single channel sphere files, both in 8kHz A-law format. All text data is UTF-8 encoded.</p>
提供机构:
Linguistic Data Consortium
创建时间:
2024-11-19



