MATERIAL Somali-English Language Pack

Name: MATERIAL Somali-English Language Pack
Creator: Linguistic Data Consortium
Published: 2025-06-03 15:17:36
License: 暂无描述

DataCite Commons2025-06-03 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2024S10

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> <p>MATERIAL Somali-English Language Pack was developed by <a href="http://www.appen.com/">Appen</a> for the IARPA (Intelligence Advanced Research Projects Activity) <a href="https://www.iarpa.gov/index.php/research-programs/material">MATERIAL</a> (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 80 hours of Somali conversational telephone speech, transcripts, English translations, annotations and queries.</p> <p>The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.</p> <h3>Data</h3> <p>The Somali speech in this release represents that spoken in the Northern and Benaadir dialect regions of Somalia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 60 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.</p> <p>Transcripts cover approximately 10% of the speech data, and approximately 4% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release.</p> <p>Somali-English Language Pack also includes domain annotations, English queries and their relevance annotations. Annotators marked transcripts by domain (e.g., lifestyle, business-and-commerce, sports, education, and so on), by query (simple, conceptual, hybrid) and by their relevance to query search terms.</p> <p>Speech data is presented either as two channel wav or single channel sphere files, predominately in 8kHz A-law format, with some wav files at a sample rate of 48kHz. All text data is UTF-8 encoded.</p> <h3>Samples</h3> <p>Please view the following samples:</p> <ul> <li><a href="desc/addenda/LDC2024S10.wav">Audio Sample (WAV)</a></li> <li><a href="desc/addenda/LDC2024S10.transcription.txt">Transcript Sample (TXT)</a></li> <li><a href="desc/addenda/LDC2024S10.translation.eng.txt">Translation Sample (TXT)</a></li> </ul> <h3>Updates</h3> <p>None at this time.</p>

提供机构：

Linguistic Data Consortium

创建时间：

2024-09-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集