MEYNG/sango-vocabulary
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MEYNG/sango-vocabulary
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- sg
- fr
- en
license: cc-by-sa-4.0
task_categories:
- translation
- text-classification
- token-classification
task_ids:
- language-modeling
- multi-class-classification
pretty_name: Sango Vocabulary Dataset
size_categories:
- n<1K
tags:
- sango
- central-african-republic
- low-resource-languages
- african-languages
- language-preservation
- trilingual
- vocabulary
- dictionary
dataset_info:
features:
- name: sango
dtype: string
- name: french
dtype: string
- name: english
dtype: string
- name: category
dtype: string
- name: difficulty
dtype: string
- name: pronunciation
dtype: string
- name: example_sango
dtype: string
- name: example_french
dtype: string
- name: example_english
dtype: string
splits:
- name: train
num_examples: 431
config_name: vocabulary
configs:
- config_name: vocabulary
data_files:
- split: train
path: vocabulary.jsonl
- config_name: training_pairs
data_files:
- split: train
path: training_pairs.jsonl
---
# Sango Vocabulary Dataset
## Dataset Description
The first structured digital vocabulary dataset for **Sango** (ISO 639-1: `sg`, ISO 639-3: `sag`), the national and most widely spoken language of the **Central African Republic**. Sango is a creole language with over 5 million speakers, yet it remains severely underrepresented in NLP research and digital resources.
This dataset provides trilingual vocabulary entries (Sango-French-English) with pronunciation guides and semantic categorization, along with Sango-French translation pairs extracted from grammar and dictionary references.
### Supported Tasks
- **Machine Translation**: Sango-French and Sango-English translation
- **Language Modeling**: Pre-training or fine-tuning language models for Sango
- **Cross-lingual Transfer**: Leveraging French/English representations for Sango NLP
- **Language Documentation**: Digital preservation of Sango vocabulary and usage patterns
- **Educational Technology**: Building language learning applications for Sango
### Languages
| Language | ISO 639-1 | ISO 639-3 | Role |
| -------- | --------- | --------- | ------------------------- |
| Sango | sg | sag | Primary target language |
| French | fr | fra | Primary bridge language |
| English | en | eng | Secondary bridge language |
## Dataset Structure
### Vocabulary (`vocabulary.jsonl`)
431 trilingual vocabulary entries, each containing:
| Field | Type | Description | Coverage |
| ----------------- | ------ | ----------------------------- | -------- |
| `sango` | string | Sango word or phrase | 100% |
| `french` | string | French translation | 100% |
| `english` | string | English translation | 100% |
| `category` | string | Semantic/grammatical category | 100% |
| `difficulty` | string | Learning difficulty level | 100% |
| `pronunciation` | string | Phonetic pronunciation guide | 100% |
| `example_sango` | string | Example sentence in Sango | <1% |
| `example_french` | string | Example sentence in French | <1% |
| `example_english` | string | Example sentence in English | <1% |
**Example entry:**
```json
{
"sango": "Bara âla",
"french": "Bonjour",
"english": "Hello",
"category": "greetings",
"difficulty": "beginner",
"pronunciation": "BAH-rah AH-lah",
"example_sango": "Bara âla, tongana nye?",
"example_french": "Bonjour, comment allez-vous?",
"example_english": "Hello, how are you?"
}
```
### Training Pairs (`training_pairs.jsonl`)
360 Sango-French translation pairs extracted from grammar and dictionary reference materials.
| Field | Type | Description |
| ------------- | ------ | ----------------------------------- |
| `source_lang` | string | Source language code (`sg` or `fr`) |
| `target_lang` | string | Target language code (`sg` or `fr`) |
| `source` | string | Source text |
| `target` | string | Target text |
**Note on quality**: These pairs were extracted from a Sango grammar reference book. While many entries contain genuine Sango-French translations and vocabulary definitions, some are sentence fragments or contextual excerpts from the source material. Users should apply filtering for downstream tasks that require clean parallel sentences.
**Example entry:**
```json
{
"source_lang": "sg",
"target_lang": "fr",
"source": "Mbï yê Bêafrîka mîngi",
"target": "j'aime beaucoup la Centrafrique"
}
```
### Category Distribution
The vocabulary spans 25 semantic categories:
| Category | Count | Category | Count |
| --------- | ----- | ----------- | ----- |
| nature | 30 | professions | 30 |
| clothing | 25 | commerce | 25 |
| education | 25 | health | 25 |
| home | 25 | adjectives | 20 |
| numbers | 20 | technology | 20 |
| transport | 20 | verbs | 18 |
| animals | 15 | emotions | 15 |
| greetings | 15 | body | 10 |
| colors | 10 | essential | 10 |
| actions | 10 | food | 12 |
| family | 12 | places | 12 |
| time | 10 | weather | 10 |
| questions | 7 | | |
### Difficulty Distribution
| Level | Count | Percentage |
| ------------ | ----- | ---------- |
| Beginner | 210 | 48.7% |
| Intermediate | 200 | 46.4% |
| Advanced | 21 | 4.9% |
## Dataset Creation
### Source Data
- **Vocabulary entries**: Curated by MEYNG from Sango language learning resources including japprendslesango.com and verified Sango vocabulary references
- **Training pairs**: Extracted from a comprehensive Sango grammar and dictionary reference work
- **Verification**: Entries reviewed against known Sango linguistic sources
### Annotation Process
Vocabulary entries were structured and categorized by native speakers and linguists familiar with Sango. Pronunciation guides use an English-approximation phonetic system designed for learners.
## Considerations for Using the Data
### Known Limitations
1. **Dataset size**: At 431 vocabulary entries and 360 translation pairs, this dataset is small by modern NLP standards. It is best suited as a seed dataset, evaluation set, or for few-shot learning scenarios.
2. **Training pair quality**: The translation pairs contain some fragmented text from the source material. Filtering is recommended for tasks requiring clean parallel data.
3. **Example sentences**: Only 1 vocabulary entry includes example sentences. Future versions aim to add examples for all entries.
4. **Dialect coverage**: Sango has regional variations. This dataset primarily represents the standardized Sango used in Bangui (the capital).
5. **Pronunciation guides**: The phonetic guides approximate Sango sounds using English phonetics. They are intended for learners, not for phonological research.
### Recommendations
- Use as a seed dataset to bootstrap Sango NLP systems
- Combine with the SangoAI API (643+ words available at `https://sangoai.sbs`) for a larger vocabulary
- Apply quality filtering on `training_pairs.jsonl` before fine-tuning
- Consider data augmentation techniques for downstream tasks
## Citation
If you use this dataset in your research, please cite:
```bibtex
@dataset{meyng_sango_vocabulary_2026,
title={Sango Vocabulary Dataset: A Trilingual Lexical Resource for an Underrepresented African Language},
author={MEYNG},
year={2026},
url={https://huggingface.co/datasets/meyng/sango-vocabulary},
license={CC-BY-SA-4.0},
language={sg, fr, en}
}
```
## License
This dataset is released under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC-BY-SA-4.0)](https://creativecommons.org/licenses/by-sa/4.0/).
You are free to:
- **Share** -- copy and redistribute the material in any medium or format
- **Adapt** -- remix, transform, and build upon the material for any purpose, including commercial
Under the following terms:
- **Attribution** -- You must give appropriate credit to MEYNG, provide a link to the license, and indicate if changes were made
- **ShareAlike** -- If you remix, transform, or build upon the material, you must distribute your contributions under the same license
## Acknowledgments
This dataset was created by **[MEYNG](https://meyng.com)** as part of the **[SangoAI](https://sangoai.sbs)** project, an AI-powered language platform dedicated to the preservation and digitization of Sango.
We acknowledge:
- The people of the **Central African Republic**, whose language and culture this dataset aims to preserve and promote
- The Sango-speaking community worldwide for their contributions to language documentation
- The creators of **japprendslesango.com** and other Sango learning resources that informed this work
- The open-source NLP community working on low-resource African languages
## Contact
- **Organization**: MEYNG
- **Website**: [meyng.com](https://meyng.com)
- **Platform**: [sangoai.sbs](https://sangoai.sbs)
- **Email**: contact@meyng.com
- **GitHub**: [github.com/meyng-hub/sangoai](https://github.com/meyng-hub/sangoai)
提供机构:
MEYNG



