CraneAILabs/luganda-fln-training-data

Name: CraneAILabs/luganda-fln-training-data
Creator: CraneAILabs
Published: 2026-04-08 16:23:14
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/CraneAILabs/luganda-fln-training-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - lug - en task_categories: - question-answering - text-classification - text-generation tags: - luganda - education - pedagogy - foundational-literacy - primary-education - uganda - bilingual - mcq - teacher-training - low-resource pretty_name: Luganda Foundational Literacy & Numeracy Training Data size_categories: - 1K<n<10K configs: - config_name: all default: true data_files: - split: train path: - "data/fln_synthetic/train.jsonl" - "data/fln_content/train.jsonl" - "data/fln_shortform/train.jsonl" - "data/fln_weak_sections/train.jsonl" - config_name: fln_synthetic data_files: - split: train path: "data/fln_synthetic/train.jsonl" - config_name: fln_content data_files: - split: train path: "data/fln_content/train.jsonl" - config_name: fln_shortform data_files: - split: train path: "data/fln_shortform/train.jsonl" - config_name: fln_weak_sections data_files: - split: train path: "data/fln_weak_sections/train.jsonl" --- # Luganda FLN Training Data Training data for foundational literacy and numeracy (FLN) models targeting Ugandan primary school teachers (P1–P3). Designed to train small language models (1B parameters) to generate pedagogically sound content in Luganda and English. ## Dataset Description This dataset contains **1,368 training examples** across four complementary splits, each targeting different aspects of teacher pedagogical content knowledge for early literacy instruction. ### Splits | Split | Count | Format | Description | |-------|-------|--------|-------------| | `fln_synthetic` | 590 | MCQ (chat template) | Bilingual Luganda/English multiple-choice questions on literacy pedagogy. Covers 16 sub-domains including phonological awareness, systematic phonics, vocabulary, reading comprehension, grammar instruction, and assessment. | | `fln_weak_sections` | 600 | MCQ (chat template) | Targeted MCQs addressing specific weak areas identified during model evaluation. Focuses on sections where the base model scored below threshold. | | `fln_shortform` | 123 | Short answer | Short-form questions testing Luganda linguistic knowledge: syntax & grammar, vocabulary & context, phonics & orthography, morphology & concord, and phonological awareness. | | `fln_content` | 55 | Long-form (chat template) | Pedagogical content generation examples across 11 literacy instruction sections. Used to train the model to produce structured lesson content. | ### Data Format All files are JSONL (one JSON object per line). Each row uses the Gemma chat template format: ```json { "text": "<start_of_turn>user\n[Question in Luganda]\n(A) ...\n(B) ...\n(C) ...\n(D) ...<end_of_turn>\n<start_of_turn>model\n[Answer]<end_of_turn>", "format": "mcq", "correct_letter": "B", "source": "fln_synthetic_phonological_awareness", "source_id": "fln_synthetic_phonological_awareness_042", "category": "Literacy" } ``` ### Domains Covered The synthetic and weak sections splits cover 16 pedagogical sub-domains: - Phonological awareness - Systematic phonics - Vocabulary instruction - Reading comprehension - Reading fluency - Grammar instruction - Writing instruction - Oral language development - High-frequency words - Decoding strategies - Morphology instruction - Instructional sequencing - Assessment methods - Pacing and differentiation - Fluency development - Comprehension strategies ## Data Generation ### Sources 1. **Luganda instructional materials** — Validated P1–P3 textbooks and grammar guides, extracted via Docling OCR pipeline from 3,425 source PDFs. 2. **Teacher assessment data** — Digitized question papers and assessments from Ugandan primary teachers providing real-world examination structures. 3. **Fab Inc pedagogical templates** — High-quality English/multilingual examples providing proven structural frameworks, adapted for Luganda context. ### Generation Pipeline Content was generated using Google Gemini 2.5 Flash (Vertex AI Batch API) with Luganda context injection from curated source materials. A 6-layer validation process ensured bilingual completeness, cultural authenticity, and pedagogical accuracy. See the accompanying pipeline specification document for full methodology. ### Quality Assurance - All MCQ items validated for correct answer alignment - Bilingual content verified for Luganda/English consistency - Cultural context reviewed for Ugandan primary school appropriateness - Data contamination analysis performed (clean — 0% overlap with evaluation set) ## Intended Use - Fine-tuning small language models (1–3B parameters) for Luganda educational content generation - Training bilingual pedagogical assistants for Ugandan primary school teachers - Research on low-resource language educational AI ## Limitations - Limited to P1–P3 foundational literacy (does not cover numeracy, science, or upper primary) - Luganda translations were AI-generated with human review — some phrasing may not reflect regional dialect preferences - MCQ format may not capture all pedagogical reasoning patterns - Training data was generated from a limited corpus of available Luganda educational materials ## Citation ```bibtex @misc{craneailabs2026fln, title={Luganda Foundational Literacy Training Data for On-Device Teacher Companion Models}, author={Bakunga, Bronson and Mubiru, Kato Steven and Tukamushaba, Catherine}, year={2026}, publisher={Crane AI Labs}, url={https://huggingface.co/datasets/CraneAILabs/luganda-fln-training-data} } ``` ## Acknowledgments This work was supported by Fab Inc, funded by the Bill & Melinda Gates Foundation. Field research and Luganda linguistic validation conducted by Crane AI Labs.

提供机构：

CraneAILabs

5,000+

优质数据集

54 个

任务类型

进入经典数据集