AndyOnyango/KenCorpus_text
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AndyOnyango/KenCorpus_text
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- sw
- luo
- bxk
- lri
- rag
license: cc-by-4.0
task_categories:
- text-generation
- text-classification
tags:
- swahili
- kiswahili
- dholuo
- lubukusu
- lumarachi
- lulogooli
- kenyan-languages
- low-resource-languages
- african-languages
- text-corpus
pretty_name: KenCorpus Text
size_categories:
- 1K<n<10K
---
# KenCorpus Text: A Kenyan Multilingual Text Corpus
## Dataset Description
**KenCorpus Text** is a multilingual text corpus for Kenyan languages, collected from language communities including indigenous stories, student compositions, native language media stations, and publishers. The corpus goes beyond conventional religious texts to represent everyday language use.
Three languages were selected: **Kiswahili**, **Luhya** (dialects: Lumarachi, Logooli, Lubukusu), and **Dholuo**.
## Dataset Statistics
| Language | Files | Source |
|----------|-------|--------|
| Swahili | 2,585 | Community texts |
| Swahili (Tweets) | 324 | Social media |
| Dholuo | 166 | Community texts |
| Lumarachi | 449 | Community texts |
| Lubukusu | 16 | Community texts |
| Logooli | 247 | Community texts |
## Data Fields
| Column | Type | Description |
|--------|------|-------------|
| text | string | Full text content |
| language | string | Language (Swahili, Dholuo, Lubukusu, Lumarachi, Logooli, Luhya_Bukusu, etc.) |
| genre | string | Genre (Culture, News, Creative_writing, Exam, Agriculture, etc.) |
| source | string | Data source (Community, Educational_Institution, Media, Publisher, etc.) |
| title | string | Title of the text if available |
| story_id | string | Unique identifier matching the KenCorpus metadata |
| source_type | string | Source category (community_text, tweets) |
### Example Record
```python
{
'text': 'HISTORIA YA SHULE YA UPILI YA OLEFREMA...',
'language': 'Swahili',
'genre': 'Creative_writing',
'source': 'Educational_Institution',
'title': 'Historia ya shule',
'story_id': '3510',
'source_type': 'community_text'
}
```
---
## Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Kencorpus/KenCorpus_text")
# Access samples
print(dataset['train'][0])
# Filter by language
swahili = dataset['train'].filter(lambda x: x['language'] == 'Swahili')
dholuo = dataset['train'].filter(lambda x: x['language'] == 'Dholuo')
print(f"Swahili: {len(swahili)}, Dholuo: {len(dholuo)}")
```
---
## Genres
Culture, News, Creative_writing, Exam, Agriculture, Commerce, Religion, Song, Health, Story, and more.
---
## Data Collection
Primary data was collected from respective language communities, including indigenous stories and narratives from student compositions, native language media stations, and publishers.
## Dataset Curators
**Kiswahili:** Rose Felynix, Khalid Kitito, Dr. Benard Okal
**Dholuo:** Jotham Ondu Ajiki, Dr. Jackline Okello, Jonathan Muga, Mercy Lavinca Oduoll
**Luhya (Logooli):** Salano Odari, Dr. Phillip Lumwamu
**Luhya (Bukusu):** Mactilda Nekesa Makana, Mulwale Martin
**Luhya (Marachi):** Yonah Weunda
---
## Citation
```bibtex
@article{wanjawa2022kencorpus,
title={Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks},
author={Wanjawa, Barack W. and Wanzare, Lilian D. and Indede, Florence and McOnyango, Owen and Ombui, Edward and Muchemi, Lawrence},
journal={arXiv preprint arXiv:2208.12081},
year={2022}
}
```
---
## Links
- **Research Paper**: https://arxiv.org/abs/2208.12081
- **Dataverse**: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KLCKL5
## License
**CC-BY-4.0**
## Acknowledgments
Part of the **Kencorpus** project for low-resource Kenyan language NLP resources.
---
语言:
- 斯瓦西里语(sw)
- 卢奥语(luo)
- 布库苏卢巴语(bxk)
- 卢马拉奇语(lri)
- 卢古利语(rag)
授权协议:CC-BY-4.0(知识共享署名4.0)
任务类别:
- 文本生成
- 文本分类
标签:
- 斯瓦西里语(swahili)
- 基斯瓦西里语(kiswahili)
- 多洛语(dholuo)
- 布库苏卢巴语(lubukusu)
- 卢马拉奇语(lumarachi)
- 卢古利语(lulogooli)
- 肯尼亚语言
- 低资源语言
- 非洲语言
- 文本语料库
展示名称:KenCorpus Text
规模类别:1K<n<10K
---
# KenCorpus Text:肯尼亚多语言文本语料库
## 数据集概况
**KenCorpus Text** 是面向肯尼亚本土语言的多语言文本语料库,数据采集自各语言社群,涵盖原住民故事、学生习作、母语媒体及出版机构产出内容。该语料库突破了传统宗教文本的局限,真实还原日常语言使用场景。
本次入选的语言包括:**基斯瓦西里语(Kiswahili)**、**卢希亚语(Luhya,包含卢马拉奇语、卢古利语、布库苏卢巴语三大方言)** 以及 **多洛语(Dholuo)**。
## 数据集统计
| 语言 | 文件数 | 数据来源 |
|------|--------|----------|
| 斯瓦西里语 | 2,585 | 社群文本 |
| 斯瓦西里语(推文) | 324 | 社交媒体 |
| 多洛语 | 166 | 社群文本 |
| 卢马拉奇语 | 449 | 社群文本 |
| 布库苏卢巴语 | 16 | 社群文本 |
| 卢古利语 | 247 | 社群文本 |
## 数据字段
| 列名 | 数据类型 | 字段说明 |
|------|----------|----------|
| text | 字符串 | 完整文本内容 |
| language | 字符串 | 文本所属语言(斯瓦西里语、多洛语、布库苏卢巴语、卢马拉奇语、卢古利语、卢希亚语-布库苏方言等) |
| genre | 字符串 | 文本体裁(文化、新闻、创意写作、考试、农业等) |
| source | 字符串 | 数据来源(社群、教育机构、媒体、出版机构等) |
| title | 字符串 | 文本标题(如可用) |
| story_id | 字符串 | 匹配KenCorpus元数据的唯一标识符 |
| source_type | 字符串 | 来源类别(社群文本、推文等) |
### 示例样本
python
{
'text': 'HISTORIA YA SHULE YA UPILI YA OLEFREMA...',
'language': 'Swahili',
'genre': 'Creative_writing',
'source': 'Educational_Institution',
'title': 'Historia ya shule',
'story_id': '3510',
'source_type': 'community_text'
}
---
## 使用方法
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("Kencorpus/KenCorpus_text")
# 访问样本
print(dataset['train'][0])
# 按语言筛选样本
swahili = dataset['train'].filter(lambda x: x['language'] == 'Swahili')
dholuo = dataset['train'].filter(lambda x: x['language'] == 'Dholuo')
print(f"斯瓦西里语样本数:{len(swahili)}, 多洛语样本数:{len(dholuo)}")
---
## 文本体裁
涵盖文化、新闻、创意写作、考试、农业、商业、宗教、歌曲、健康、故事等多种类型。
---
## 数据采集
核心数据采集自各对应语言社群,包括原住民故事与叙事、学生习作、母语媒体及出版机构产出内容。
## 数据集整理团队
**基斯瓦西里语:** Rose Felynix、Khalid Kitito、Benard Okal博士
**多洛语:** Jotham Ondu Ajiki、Jackline Okello博士、Jonathan Muga、Mercy Lavinca Oduoll
**卢希亚语(卢古利方言):** Salano Odari、Phillip Lumwamu博士
**卢希亚语(布库苏方言):** Mactilda Nekesa Makana、Mulwale Martin
**卢希亚语(马拉奇方言):** Yonah Weunda
---
## 引用格式
bibtex
@article{wanjawa2022kencorpus,
title={Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks},
author={Wanjawa, Barack W. and Wanzare, Lilian D. and Indede, Florence and McOnyango, Owen and Ombui, Edward and Muchemi, Lawrence},
journal={arXiv preprint arXiv:2208.12081},
year={2022}
}
---
## 相关链接
- **研究论文:** https://arxiv.org/abs/2208.12081
- **Dataverse数据仓库:** https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KLCKL5
## 授权协议
**CC-BY-4.0(知识共享署名4.0)**
## 致谢
本数据集属于面向肯尼亚低资源语言自然语言处理资源的**Kencorpus**项目的一部分。
提供机构:
AndyOnyango



