labosofaith/webCbdataset
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/labosofaith/webCbdataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# Dataset Card for Competency-Based Curriculum (CBC) Dataset - Kenya
This dataset card describes a curated collection of data extracted from online sources covering Kenya’s Competency-Based Curriculum (CBC). It is intended for educational research, policy analysis, and machine learning tasks focused on curriculum understanding, content classification, and educational resource generation.
## Dataset Details
### Dataset Description
This dataset contains structured and unstructured data related to Kenya's CBC framework, including subjects, competencies, values, learning outcomes, performance indicators, and thematic lesson elements. The data has been extracted from various educational websites and documents aligned with the Ministry of Education's guidelines.
- **Curated by:** Kariuki James
- **Funded by [optional]:** Not funded
- **Shared by [optional]:** Kariuki James
- **Language(s) (NLP):** English
- **License:** Apache 2.0
### Dataset Sources
- **Repository:** [To be added if hosted online]
- **Paper [optional]:** N/A
- **Demo [optional]:** N/A
## Uses
### Direct Use
The CBC dataset can be used for:
- Curriculum mapping and analysis
- Educational policy development
- Training and evaluating NLP models for educational applications
- Generating personalized feedback for learners based on curriculum standards
### Out-of-Scope Use
The dataset is not intended for:
- Commercial textbook generation without licensing
- Misrepresentation of the Kenyan curriculum
- Any use that compromises data integrity or privacy
## Dataset Structure
The dataset is structured into:
- JSONL records with fields such as `text`, `subject`, `competency`, `value`, `theme`, etc.
- Each record reflects curriculum content or thematic educational sentences linked to specific CBC outcomes.
## Dataset Creation
### Curation Rationale
The motivation is to build accessible, machine-readable educational datasets based on CBC to empower teachers, curriculum developers, and AI researchers in Kenya and globally.
### Source Data
#### Data Collection and Processing
- Extracted using Python web scraping tools (BeautifulSoup, Requests)
- Filtered using regex and keyword match against CBC terms
- Cleaned for formatting and sentence clarity
- Manually validated for relevance
#### Who are the source data producers?
Original content producers include:
- Kenya Institute of Curriculum Development (KICD)
- Ministry of Education (Kenya)
- Local school curriculum content publishers
### Annotations [optional]
#### Annotation process
Content was annotated using heuristic keyword-label mapping to assign themes, values, and competencies where applicable.
#### Who are the annotators?
Annotations and labeling done by Kariuki James (teacher and data analyst).
#### Personal and Sensitive Information
The dataset does not contain personal, sensitive, or private information.
## Bias, Risks, and Limitations
- Content bias may arise from limited access to all CBC documents or regional educational variations.
- The dataset depends on public and scraped online content, which may not fully represent all CBC subjects or grade levels.
### Recommendations
- Further validation with curriculum experts is advised.
- Not to be used in isolation for national-level decisions without Ministry review.
## Citation [optional]
**BibTeX:**
```bibtex
@dataset{james2025cbc,
author = {Kariuki James},
title = {Competency-Based Curriculum (CBC) Dataset - Kenya},
year = {2025},
url = {https://huggingface.co/datasets/your_dataset_path},
note = {Dataset curated from online Kenyan educational resources}
}
提供机构:
labosofaith
搜集汇总
数据集介绍

构建方式
在数字化教育资源的浪潮中,webCbdataset的构建体现了对肯尼亚能力本位课程(CBC)框架的系统性梳理。该数据集通过Python网络爬虫工具(如BeautifulSoup和Requests)从肯尼亚教育部及课程开发机构的相关网站与文档中提取原始内容,随后利用正则表达式和关键词匹配技术进行过滤,以确保数据与CBC术语的一致性。经过格式清理和句子清晰化处理,数据最终以JSONL格式结构化存储,每条记录包含文本、学科、能力、价值观等字段,并通过启发式关键词标注方法手动验证与标注,从而形成机器可读的教育资源集合。
特点
webCbdataset的核心特点在于其专注于肯尼亚能力本位课程的结构化表示,涵盖了学科、能力、价值观、学习成果及主题课程元素等多维度信息。数据集以英语呈现,采用Apache 2.0许可协议,确保了使用的开放性与灵活性。其内容源于肯尼亚课程发展研究所和教育部等权威机构,经过人工验证与标注,具有较高的教育研究价值。然而,数据集可能受限于公开资源的覆盖范围,存在区域教育差异或学科层级不全的潜在偏差,需结合专家评估谨慎应用。
使用方法
该数据集适用于教育研究、政策分析与机器学习任务,特别是课程理解、内容分类和教育资源生成等领域。用户可直接加载JSONL格式文件,利用文本、学科、能力等字段进行课程映射分析或训练自然语言处理模型,以支持个性化学习反馈的开发。需要注意的是,数据集不应用于商业教材生成或曲解肯尼亚课程内容,且建议在国家级决策前结合官方审查,以保障数据使用的准确性与伦理合规性。
背景与挑战
背景概述
随着全球教育体系向能力本位课程(CBC)转型,肯尼亚于近年推行了以学生为中心的教育改革,旨在培养学习者的核心素养与实践技能。在此背景下,由教育工作者兼数据分析师Kariuki James于2025年创建的webCbdataset应运而生,该数据集系统性地整合了肯尼亚能力本位课程框架中的学科内容、能力指标、价值观念及学习成果等结构化与非结构化数据。其核心研究问题聚焦于如何利用机器学习技术解析课程标准,以支持教育政策分析、课程资源生成及个性化学习反馈等应用,为教育技术研究提供了重要的数据基础。
当前挑战
该数据集致力于解决教育领域中课程内容理解与自动分类的挑战,具体包括如何从异构的在线教育资源中准确提取与课程标准对齐的语义信息,以及如何构建适用于自然语言处理模型的标注体系以支持课程映射与资源生成任务。在构建过程中,面临的主要挑战源于数据源的分散性与非规范性,需通过网页爬取与启发式标注方法处理原始内容的格式噪声与区域教育差异,同时确保数据不涉及个人隐私信息,但可能受限于公开资源的覆盖范围,存在内容代表性不足的风险。
常用场景
经典使用场景
在教育研究领域,webCbdataset为课程分析与政策制定提供了结构化数据支持。该数据集通过整合肯尼亚能力本位课程框架中的学科、能力、价值观及学习成果等要素,常被用于课程映射与内容分类任务。研究人员可借助其机器可读格式,深入探究课程标准的实施效果,或开发自动化工具以辅助教育资源管理,从而推动教育系统的优化与创新。
实际应用
在实际应用中,webCbdataset能够赋能教育工作者与政策制定者进行精准的课程资源开发。例如,教师可利用数据集中的主题与能力关联信息,设计符合国家标准的教学材料;教育机构则可基于其结构化内容,构建智能辅导系统或自动化评估工具,以适配不同学习者的需求。此外,该数据集还为跨国教育比较研究提供了可扩展的数据参照。
衍生相关工作
围绕webCbdataset,已衍生出多项聚焦教育人工智能的经典工作。例如,研究者利用其标注数据训练课程内容分类模型,以自动识别学习主题与能力对应关系;另有工作基于该数据集开发课程生成系统,辅助教师快速创建符合能力本位框架的教学计划。这些研究不仅丰富了教育数据挖掘的方法体系,也为全球类似课程体系的数字化转型提供了实践范例。
以上内容由遇见数据集搜集并总结生成



