five

AI-Assisted Syllabification and Tone Annotation Datasets for South-South Nigerian Languages

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/5kgjnhzsmd
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset includes lexical, syllabic, and tonal data from 25 South-South Nigerian languages. To achieve the data, phonotactic rules and tone patterns were semantically interpreted by GPT-4 to build syllabification and tone-labelling engines. The process followed a human-in-the-loop framework ensuring accurate rule execution across languages. Contents: 1) Language_Profile_Wordlist.xlsx: Original language profile/template and wordlists 2) Language_Profile_Wordlist_Updated.xlsx: Updated (corrected/edited) profile/template and wordlists 3) wordlist.xlsx: Swadesh lexical list (108 English words) used as the primary input for documenting the 25 languages. 4) syllabified_output_result.zip: Syllabified outputs (without corrected/edited profile and wordlists) annotated without human-in-the-loop validation labels across 25 languages. 5) syllabified_output_result_with_HILV.zip: Syllabified outputs (without corrected/edited profile and wordlists) annotated with human-in-the-loop validation labels across 25 languages. 6) syllabified_output_result_updated.zip: Updated syllabified outputs (with corrected/edited profile and wordlists) per language. 7) structure_transformed_updated.zip: Syllabified outputs further transformed into syllable structure representations (e.g., V-CV, N-CVV) with a transformed syllable structure column (N_CV Structure) 8) tone_labelling_output_result.zip: Wordlists with auto-generated tone labels without human-in-the-loop validation. 9) tone_labelling_output_result_updated.zip: Wordlists with auto-generated tone labels with human-in-the-loop validation (edited/corrected tone labels). Applications: 1) Phonological typology: Contents 3, 4, 5, 6, and contents 7, 8, provide syllabified and tone-annotated wordlists across 25 Nigerian languages, enabling comparative analysis of phonotactic constraints, syllable structures (e.g., CV, CVV, N-CVC), and tonal systems. 2) Cross-linguistic NLP development: By offering consistent syllabification and tone annotations, the dataset would facilitate the design and evaluation of NLP tools that can generalize across typologically diverse languages. Use cases include multilingual grapheme-to-phoneme converters, syllable-aware machine translation models, and pronunciation-aware text-to-speech systems. 3) Low-resource language modeling: The dataset could serve as a valuable resource for training and evaluating AI models in low-resource settings. 4) Tonal morphology and suprasegmental analysis: The tone-labelled column in the dataset allow for the study of tonal morphology, such as tone sandhi, tone disambiguation, and morphotonemic alternation. Format: All files are provided in .xlsx and .zip formats for ease of integration into language modeling pipelines.
创建时间:
2025-06-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作