AI-Assisted Syllabification and Tone Annotation Datasets for South-South Nigerian Languages
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/5kgjnhzsmd
下载链接
链接失效反馈官方服务:
资源简介:
This dataset includes lexical, syllabic, and tonal data from 25 South-South Nigerian languages. To achieve the data, phonotactic rules and tone patterns were semantically interpreted by GPT-4 to build syllabification and tone-labelling engines. The process followed a human-in-the-loop framework ensuring accurate rule execution across languages.
Contents:
1) Language_Profile_Wordlist.xlsx: Original language profile/template and wordlists
2) Language_Profile_Wordlist_Updated.xlsx: Updated (corrected/edited) profile/template and wordlists
3) wordlist.xlsx: Swadesh lexical list (108 English words) used as the primary input for documenting the 25 languages.
4) syllabified_output_result.zip: Syllabified outputs (without corrected/edited profile and wordlists) annotated without human-in-the-loop validation labels across 25 languages.
5) syllabified_output_result_with_HILV.zip: Syllabified outputs (without corrected/edited profile and wordlists) annotated with human-in-the-loop validation labels across 25 languages.
6) syllabified_output_result_updated.zip: Updated syllabified outputs (with corrected/edited profile and wordlists) per language.
7) structure_transformed_updated.zip: Syllabified outputs further transformed into syllable structure representations (e.g., V-CV, N-CVV) with a transformed syllable structure column (N_CV Structure)
8) tone_labelling_output_result.zip: Wordlists with auto-generated tone labels without human-in-the-loop validation.
9) tone_labelling_output_result_updated.zip: Wordlists with auto-generated tone labels with human-in-the-loop validation (edited/corrected tone labels).
Applications:
1) Phonological typology: Contents 3, 4, 5, 6, and contents 7, 8, provide syllabified and tone-annotated wordlists across 25 Nigerian languages, enabling comparative analysis of phonotactic constraints, syllable structures (e.g., CV, CVV, N-CVC), and tonal systems.
2) Cross-linguistic NLP development: By offering consistent syllabification and tone annotations, the dataset would facilitate the design and evaluation of NLP tools that can generalize across typologically diverse languages. Use cases include multilingual grapheme-to-phoneme converters, syllable-aware machine translation models, and pronunciation-aware text-to-speech systems.
3) Low-resource language modeling: The dataset could serve as a valuable resource for training and evaluating AI models in low-resource settings.
4) Tonal morphology and suprasegmental analysis: The tone-labelled column in the dataset allow for the study of tonal morphology, such as tone sandhi, tone disambiguation, and morphotonemic alternation.
Format:
All files are provided in .xlsx and .zip formats for ease of integration into language modeling pipelines.
创建时间:
2025-06-25



