Replication Data for Igbo Natural Language Processing Tasks II
收藏DataCite Commons2025-05-12 更新2025-05-17 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/YB9FWK
下载链接
链接失效反馈官方服务:
资源简介:
The Igbo synchronised corpus (IgboSynCorp) is an annotated corpus of spoken Igbo created by a team of linguists and NLP experts at the University of Ibadan and Afe Babalola University, Nigeria. The project was designed to create an open access labelled and unlabelled dataset for Natural Language Processing tasks in the Igbo language. The dataset was created to enable robust and more equitable application of machine learning tools of high social value in Igbo.
The dataset is consists of ELAN text and wav files of Igbo speech. There are two categories of ELAN files: Gold files (90 mins) and Non Gold files (188 mins). The Gold files (19,722 words or 2761 sentences were transcribed phonetically and orthographically, translated to English, glossed and PoS tagged based on the universal dependency PoS tags . The None Gold files were only transcribed orthographically and translated to English. There are 110 recordings of spoken Igbo (.wav Files) amounting to 38.8075 hours or 2,328.45 minutes. There are 110 wav files of Igbo Oral narratives. The metadata is compiled in excel sheets. The Igbosyncorp Metadata I contains the demographic information about the language consultants. While Igbosyncorp metadata II outlines domains of speech represented in the individual wav file (oral narrative). There are two lexicon files with about 2300 words altogether which originated from the glossing and part of speech tagging,
The project was funded by Lacuna Fund https://lacunafund.org of the Meridian Institute, 105 Village Place, Dillion, Colorado 80435, United States of America.
提供机构:
Harvard Dataverse
创建时间:
2022-06-21



