five

MEYNG/sango-vocabulary

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MEYNG/sango-vocabulary
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - sg - fr - en license: cc-by-sa-4.0 task_categories: - translation - text-classification - token-classification task_ids: - language-modeling - multi-class-classification pretty_name: Sango Vocabulary Dataset size_categories: - n<1K tags: - sango - central-african-republic - low-resource-languages - african-languages - language-preservation - trilingual - vocabulary - dictionary dataset_info: features: - name: sango dtype: string - name: french dtype: string - name: english dtype: string - name: category dtype: string - name: difficulty dtype: string - name: pronunciation dtype: string - name: example_sango dtype: string - name: example_french dtype: string - name: example_english dtype: string splits: - name: train num_examples: 431 config_name: vocabulary configs: - config_name: vocabulary data_files: - split: train path: vocabulary.jsonl - config_name: training_pairs data_files: - split: train path: training_pairs.jsonl --- # Sango Vocabulary Dataset ## Dataset Description The first structured digital vocabulary dataset for **Sango** (ISO 639-1: `sg`, ISO 639-3: `sag`), the national and most widely spoken language of the **Central African Republic**. Sango is a creole language with over 5 million speakers, yet it remains severely underrepresented in NLP research and digital resources. This dataset provides trilingual vocabulary entries (Sango-French-English) with pronunciation guides and semantic categorization, along with Sango-French translation pairs extracted from grammar and dictionary references. ### Supported Tasks - **Machine Translation**: Sango-French and Sango-English translation - **Language Modeling**: Pre-training or fine-tuning language models for Sango - **Cross-lingual Transfer**: Leveraging French/English representations for Sango NLP - **Language Documentation**: Digital preservation of Sango vocabulary and usage patterns - **Educational Technology**: Building language learning applications for Sango ### Languages | Language | ISO 639-1 | ISO 639-3 | Role | | -------- | --------- | --------- | ------------------------- | | Sango | sg | sag | Primary target language | | French | fr | fra | Primary bridge language | | English | en | eng | Secondary bridge language | ## Dataset Structure ### Vocabulary (`vocabulary.jsonl`) 431 trilingual vocabulary entries, each containing: | Field | Type | Description | Coverage | | ----------------- | ------ | ----------------------------- | -------- | | `sango` | string | Sango word or phrase | 100% | | `french` | string | French translation | 100% | | `english` | string | English translation | 100% | | `category` | string | Semantic/grammatical category | 100% | | `difficulty` | string | Learning difficulty level | 100% | | `pronunciation` | string | Phonetic pronunciation guide | 100% | | `example_sango` | string | Example sentence in Sango | <1% | | `example_french` | string | Example sentence in French | <1% | | `example_english` | string | Example sentence in English | <1% | **Example entry:** ```json { "sango": "Bara âla", "french": "Bonjour", "english": "Hello", "category": "greetings", "difficulty": "beginner", "pronunciation": "BAH-rah AH-lah", "example_sango": "Bara âla, tongana nye?", "example_french": "Bonjour, comment allez-vous?", "example_english": "Hello, how are you?" } ``` ### Training Pairs (`training_pairs.jsonl`) 360 Sango-French translation pairs extracted from grammar and dictionary reference materials. | Field | Type | Description | | ------------- | ------ | ----------------------------------- | | `source_lang` | string | Source language code (`sg` or `fr`) | | `target_lang` | string | Target language code (`sg` or `fr`) | | `source` | string | Source text | | `target` | string | Target text | **Note on quality**: These pairs were extracted from a Sango grammar reference book. While many entries contain genuine Sango-French translations and vocabulary definitions, some are sentence fragments or contextual excerpts from the source material. Users should apply filtering for downstream tasks that require clean parallel sentences. **Example entry:** ```json { "source_lang": "sg", "target_lang": "fr", "source": "Mbï yê Bêafrîka mîngi", "target": "j'aime beaucoup la Centrafrique" } ``` ### Category Distribution The vocabulary spans 25 semantic categories: | Category | Count | Category | Count | | --------- | ----- | ----------- | ----- | | nature | 30 | professions | 30 | | clothing | 25 | commerce | 25 | | education | 25 | health | 25 | | home | 25 | adjectives | 20 | | numbers | 20 | technology | 20 | | transport | 20 | verbs | 18 | | animals | 15 | emotions | 15 | | greetings | 15 | body | 10 | | colors | 10 | essential | 10 | | actions | 10 | food | 12 | | family | 12 | places | 12 | | time | 10 | weather | 10 | | questions | 7 | | | ### Difficulty Distribution | Level | Count | Percentage | | ------------ | ----- | ---------- | | Beginner | 210 | 48.7% | | Intermediate | 200 | 46.4% | | Advanced | 21 | 4.9% | ## Dataset Creation ### Source Data - **Vocabulary entries**: Curated by MEYNG from Sango language learning resources including japprendslesango.com and verified Sango vocabulary references - **Training pairs**: Extracted from a comprehensive Sango grammar and dictionary reference work - **Verification**: Entries reviewed against known Sango linguistic sources ### Annotation Process Vocabulary entries were structured and categorized by native speakers and linguists familiar with Sango. Pronunciation guides use an English-approximation phonetic system designed for learners. ## Considerations for Using the Data ### Known Limitations 1. **Dataset size**: At 431 vocabulary entries and 360 translation pairs, this dataset is small by modern NLP standards. It is best suited as a seed dataset, evaluation set, or for few-shot learning scenarios. 2. **Training pair quality**: The translation pairs contain some fragmented text from the source material. Filtering is recommended for tasks requiring clean parallel data. 3. **Example sentences**: Only 1 vocabulary entry includes example sentences. Future versions aim to add examples for all entries. 4. **Dialect coverage**: Sango has regional variations. This dataset primarily represents the standardized Sango used in Bangui (the capital). 5. **Pronunciation guides**: The phonetic guides approximate Sango sounds using English phonetics. They are intended for learners, not for phonological research. ### Recommendations - Use as a seed dataset to bootstrap Sango NLP systems - Combine with the SangoAI API (643+ words available at `https://sangoai.sbs`) for a larger vocabulary - Apply quality filtering on `training_pairs.jsonl` before fine-tuning - Consider data augmentation techniques for downstream tasks ## Citation If you use this dataset in your research, please cite: ```bibtex @dataset{meyng_sango_vocabulary_2026, title={Sango Vocabulary Dataset: A Trilingual Lexical Resource for an Underrepresented African Language}, author={MEYNG}, year={2026}, url={https://huggingface.co/datasets/meyng/sango-vocabulary}, license={CC-BY-SA-4.0}, language={sg, fr, en} } ``` ## License This dataset is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC-BY-SA-4.0)](https://creativecommons.org/licenses/by-sa/4.0/). You are free to: - **Share** -- copy and redistribute the material in any medium or format - **Adapt** -- remix, transform, and build upon the material for any purpose, including commercial Under the following terms: - **Attribution** -- You must give appropriate credit to MEYNG, provide a link to the license, and indicate if changes were made - **ShareAlike** -- If you remix, transform, or build upon the material, you must distribute your contributions under the same license ## Acknowledgments This dataset was created by **[MEYNG](https://meyng.com)** as part of the **[SangoAI](https://sangoai.sbs)** project, an AI-powered language platform dedicated to the preservation and digitization of Sango. We acknowledge: - The people of the **Central African Republic**, whose language and culture this dataset aims to preserve and promote - The Sango-speaking community worldwide for their contributions to language documentation - The creators of **japprendslesango.com** and other Sango learning resources that informed this work - The open-source NLP community working on low-resource African languages ## Contact - **Organization**: MEYNG - **Website**: [meyng.com](https://meyng.com) - **Platform**: [sangoai.sbs](https://sangoai.sbs) - **Email**: contact@meyng.com - **GitHub**: [github.com/meyng-hub/sangoai](https://github.com/meyng-hub/sangoai)
提供机构:
MEYNG
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作