CraneAILabs/luganda-fln-training-data
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CraneAILabs/luganda-fln-training-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- lug
- en
task_categories:
- question-answering
- text-classification
- text-generation
tags:
- luganda
- education
- pedagogy
- foundational-literacy
- primary-education
- uganda
- bilingual
- mcq
- teacher-training
- low-resource
pretty_name: Luganda Foundational Literacy & Numeracy Training Data
size_categories:
- 1K<n<10K
configs:
- config_name: all
default: true
data_files:
- split: train
path:
- "data/fln_synthetic/train.jsonl"
- "data/fln_content/train.jsonl"
- "data/fln_shortform/train.jsonl"
- "data/fln_weak_sections/train.jsonl"
- config_name: fln_synthetic
data_files:
- split: train
path: "data/fln_synthetic/train.jsonl"
- config_name: fln_content
data_files:
- split: train
path: "data/fln_content/train.jsonl"
- config_name: fln_shortform
data_files:
- split: train
path: "data/fln_shortform/train.jsonl"
- config_name: fln_weak_sections
data_files:
- split: train
path: "data/fln_weak_sections/train.jsonl"
---
# Luganda FLN Training Data
Training data for foundational literacy and numeracy (FLN) models targeting Ugandan primary school teachers (P1–P3). Designed to train small language models (1B parameters) to generate pedagogically sound content in Luganda and English.
## Dataset Description
This dataset contains **1,368 training examples** across four complementary splits, each targeting different aspects of teacher pedagogical content knowledge for early literacy instruction.
### Splits
| Split | Count | Format | Description |
|-------|-------|--------|-------------|
| `fln_synthetic` | 590 | MCQ (chat template) | Bilingual Luganda/English multiple-choice questions on literacy pedagogy. Covers 16 sub-domains including phonological awareness, systematic phonics, vocabulary, reading comprehension, grammar instruction, and assessment. |
| `fln_weak_sections` | 600 | MCQ (chat template) | Targeted MCQs addressing specific weak areas identified during model evaluation. Focuses on sections where the base model scored below threshold. |
| `fln_shortform` | 123 | Short answer | Short-form questions testing Luganda linguistic knowledge: syntax & grammar, vocabulary & context, phonics & orthography, morphology & concord, and phonological awareness. |
| `fln_content` | 55 | Long-form (chat template) | Pedagogical content generation examples across 11 literacy instruction sections. Used to train the model to produce structured lesson content. |
### Data Format
All files are JSONL (one JSON object per line). Each row uses the Gemma chat template format:
```json
{
"text": "<start_of_turn>user\n[Question in Luganda]\n(A) ...\n(B) ...\n(C) ...\n(D) ...<end_of_turn>\n<start_of_turn>model\n[Answer]<end_of_turn>",
"format": "mcq",
"correct_letter": "B",
"source": "fln_synthetic_phonological_awareness",
"source_id": "fln_synthetic_phonological_awareness_042",
"category": "Literacy"
}
```
### Domains Covered
The synthetic and weak sections splits cover 16 pedagogical sub-domains:
- Phonological awareness
- Systematic phonics
- Vocabulary instruction
- Reading comprehension
- Reading fluency
- Grammar instruction
- Writing instruction
- Oral language development
- High-frequency words
- Decoding strategies
- Morphology instruction
- Instructional sequencing
- Assessment methods
- Pacing and differentiation
- Fluency development
- Comprehension strategies
## Data Generation
### Sources
1. **Luganda instructional materials** — Validated P1–P3 textbooks and grammar guides, extracted via Docling OCR pipeline from 3,425 source PDFs.
2. **Teacher assessment data** — Digitized question papers and assessments from Ugandan primary teachers providing real-world examination structures.
3. **Fab Inc pedagogical templates** — High-quality English/multilingual examples providing proven structural frameworks, adapted for Luganda context.
### Generation Pipeline
Content was generated using Google Gemini 2.5 Flash (Vertex AI Batch API) with Luganda context injection from curated source materials. A 6-layer validation process ensured bilingual completeness, cultural authenticity, and pedagogical accuracy. See the accompanying pipeline specification document for full methodology.
### Quality Assurance
- All MCQ items validated for correct answer alignment
- Bilingual content verified for Luganda/English consistency
- Cultural context reviewed for Ugandan primary school appropriateness
- Data contamination analysis performed (clean — 0% overlap with evaluation set)
## Intended Use
- Fine-tuning small language models (1–3B parameters) for Luganda educational content generation
- Training bilingual pedagogical assistants for Ugandan primary school teachers
- Research on low-resource language educational AI
## Limitations
- Limited to P1–P3 foundational literacy (does not cover numeracy, science, or upper primary)
- Luganda translations were AI-generated with human review — some phrasing may not reflect regional dialect preferences
- MCQ format may not capture all pedagogical reasoning patterns
- Training data was generated from a limited corpus of available Luganda educational materials
## Citation
```bibtex
@misc{craneailabs2026fln,
title={Luganda Foundational Literacy Training Data for On-Device Teacher Companion Models},
author={Bakunga, Bronson and Mubiru, Kato Steven and Tukamushaba, Catherine},
year={2026},
publisher={Crane AI Labs},
url={https://huggingface.co/datasets/CraneAILabs/luganda-fln-training-data}
}
```
## Acknowledgments
This work was supported by Fab Inc, funded by the Bill & Melinda Gates Foundation. Field research and Luganda linguistic validation conducted by Crane AI Labs.
提供机构:
CraneAILabs



