CraneAILabs/luganda-bilingual-literacy-exercises
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CraneAILabs/luganda-bilingual-literacy-exercises
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- lug
- en
task_categories:
- question-answering
- text-generation
- translation
tags:
- luganda
- education
- bilingual
- literacy
- primary-education
- uganda
- phonics
- reading-comprehension
- grammar
- vocabulary
- pedagogy
- low-resource
- african-languages
pretty_name: Luganda-English Bilingual Literacy Exercises (P1-P3)
size_categories:
- 1K<n<10K
configs:
- config_name: all
default: true
data_files:
- split: train
path:
- "data/p1/train.jsonl"
- "data/p2/train.jsonl"
- "data/p3/train.jsonl"
- config_name: p1
data_files:
- split: train
path: "data/p1/train.jsonl"
- config_name: p2
data_files:
- split: train
path: "data/p2/train.jsonl"
- config_name: p3
data_files:
- split: train
path: "data/p3/train.jsonl"
---
# Luganda-English Bilingual Literacy Exercises (P1–P3)
**3,472 structured bilingual exercises** for Ugandan primary school literacy instruction (Primary 1 through Primary 3). Each exercise contains parallel English and Luganda versions with questions, answers, and explanations.
## Dataset Description
| Grade | Exercises | File |
|-------|-----------|------|
| P1 | 1,157 | `data/p1_exercises.json` |
| P2 | 1,135 | `data/p2_exercises.json` |
| P3 | 1,180 | `data/p3_exercises.json` |
| **Total** | **3,472** | |
### Exercise Types
Exercises span four literacy domains:
| Domain | Description | Example |
|--------|-------------|---------|
| **Phonological** | Syllable counting, rhyming, sound manipulation | "How many syllables does 'boda-boda' have?" |
| **Comprehension** | Reading passages with questions | Short Luganda passages with recall questions |
| **Vocabulary** | Word meaning, context clues, word categories | Matching words to definitions in both languages |
| **Grammar** | Noun classes, verb conjugation, sentence structure | Luganda noun class prefix identification |
### Data Format
Each exercise is a structured JSON object:
```json
{
"exercise_id": "PHON_P1_0001",
"grade": "P1",
"type": "syllable_counting",
"difficulty": "easy",
"is_pseudo_word": false,
"word_tested": "boda-boda",
"cultural_context": "Common Ugandan motorcycle taxi",
"english": {
"question": "How many syllables does 'boda-boda' have?",
"answer": "4",
"explanation": "Boda-boda has 4 syllables: bo-da-bo-da."
},
"luganda": {
"question": "Boda-boda erina ensuku mmeka?",
"answer": "4",
"explanation": "Boda-boda erina ensuku nnya: bo-da-bo-da."
},
"metadata": {
"source_document": "curated_phonics_guide",
"generation_batch": 12,
"context_chunk": 2,
"validation_passed": true
}
}
```
### Key Fields
| Field | Type | Description |
|-------|------|-------------|
| `exercise_id` | string | Unique ID: `{DOMAIN}_{GRADE}_{NUMBER}` (e.g., `PHON_P1_0001`) |
| `grade` | string | Target grade level: `P1`, `P2`, or `P3` |
| `type` | string | Exercise type (e.g., `syllable_counting`, `fill_in_blank`, `multiple_choice`) |
| `difficulty` | string | `easy`, `medium`, or `hard` |
| `cultural_context` | string | Ugandan cultural context for the exercise content |
| `english` | object | English version with `question`, `answer`, `explanation` |
| `luganda` | object | Luganda version with `question`, `answer`, `explanation` |
| `metadata` | object | Generation metadata: source document, batch, chunk, validation status |
## Generation Pipeline
### 8-Phase Pipeline
1. **Source Documents** — 3,425 Luganda/English educational PDFs (5.4 GB)
2. **Docling Extraction** — OCR + structure parsing → 97,286 JSON files
3. **Document Curation** — Scored by Luganda content density → top 15 documents selected
4. **Full Context Extraction** — 86.3% Luganda content retention (vs 3.1% with minimal context)
5. **Context Chunking** — 5 chunks × 4 literacy domains = 20 context blocks (~12,500 tokens each)
6. **AI Generation** — Google Gemini 2.5 Flash (Vertex AI Batch API), 136 batches, context rotation
7. **6-Layer Validation** — Bilingual completeness, content accuracy, cultural appropriateness
8. **Deduplication** — Removed 523 duplicates (33% rate in P1; lower in P2/P3)
### Context Injection
The generation model received authentic Luganda educational content at each batch, including:
- Luganda grammar patterns (noun class prefixes: ba-, ki-, mu-)
- Cultural vocabulary (matoke, posho, boda-boda)
- Example sentences from validated textbooks
This context rotation across 5 chunks ensured diversity while maintaining linguistic authenticity.
## Intended Use
- Training bilingual educational AI systems for Ugandan schools
- Generating literacy assessments aligned to Uganda's P1–P3 curriculum
- Research on bilingual exercise generation for low-resource languages
- Fine-tuning language models for Luganda educational content
## Limitations
- Luganda content was AI-generated with context injection from authentic materials — some exercises may contain unnatural phrasing
- Exercises are limited to four literacy domains; numeracy and other subjects are not covered
- Cultural context is Ugandan-specific and may not transfer to other Luganda-speaking regions
- Pseudo-word exercises (for decoding assessment) are phonetically plausible but not linguistically validated by native speaker linguists
- P2 and P3 exercises were generated in later batches and may have slightly different quality characteristics than P1
## Citation
```bibtex
@misc{craneailabs2026bilingual,
title={Luganda-English Bilingual Literacy Exercises for Uganda's P1-P3 Curriculum},
author={Bakunga, Bronson and Mubiru, Kato Steven and Tukamushaba, Catherine},
year={2026},
publisher={Crane AI Labs},
url={https://huggingface.co/datasets/CraneAILabs/luganda-bilingual-literacy-exercises}
}
```
## Acknowledgments
Supported by Fab Inc, funded by the Bill & Melinda Gates Foundation. Luganda textbook content extracted via Docling OCR pipeline. Field research and validation conducted by Crane AI Labs.
提供机构:
CraneAILabs



