ngusadeep/Swahili-Corpus-Dataset
收藏Hugging Face2026-01-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ngusadeep/Swahili-Corpus-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 1693227 # total lines in all shards
download_size: null
dataset_size: null
configs:
- config_name: default
data_files:
- split: train
path: Swahili_Corpus_combined.txt
license: apache-2.0
language:
- sw
pretty_name: Swahili Corpus Dataset
size_categories:
- 1M<n<10M
task_categories:
- text-generation
- feature-extraction
tags:
- swahili
- kiswahili
- africa
- low-resource-language
- llm-pretraining
- text-corpus
dataset_source: mendeley
dataset_type: raw_text
---
# Swahili Corpus Dataset
**A large-scale Swahili text corpus for language model pretraining and NLP research.**
The **Swahili Corpus Dataset** is a large-scale collection of Swahili (Kiswahili) text designed to support Natural Language Processing (NLP) research and the development of large language models (LLMs) for a low-resource African language.
This dataset contains approximately **1.69 million Swahili text samples** aggregated from public and official sources. It is suitable for **LLM pretraining, continual pretraining, tokenizer training, embeddings, fine-tuning, and Retrieval-Augmented Generation (RAG)**.
This Hugging Face release is a **curated and unified version** derived from the original **Swahili Corpus** published on **Mendeley Data** by **Noel Masasi** and **Bernard Masua (2024)**.
## Dataset Summary
The Swahili Corpus Dataset aggregates real-world Swahili text from multiple thematic domains, including government, health, education, agriculture, law, and news. The corpus is **unannotated** and optimized for **unsupervised learning**, making it ideal for training and adapting foundation language models to Kiswahili.
The goal of this dataset is to strengthen Swahili representation in modern NLP systems and to support African language AI research.
## Loading the Dataset
You can load the dataset using 🤗 Datasets:
```python
from datasets import load_dataset
ds = load_dataset("ngusadeep/Swahili-Corpus-Dataset", split="train")
# Access first example
print(ds[0]["text"])
```
## Dataset Structure
### Data Fields
Each example contains a single field:
| Field | Type | Description |
| ------ | ------ | ----------------------- |
| `text` | string | Raw Swahili text sample |
### Example
```python
{
"text": "Serikali ya Jamhuri ya Muungano wa Tanzania imetoa taarifa rasmi kuhusu maendeleo ya sekta ya afya..."
}
```
There are **no labels, annotations, or metadata fields**, making this dataset suitable for large-scale unsupervised training.
## Original Corpus Categories
The original Swahili Corpus was organized into the following thematic categories, which were merged during preprocessing:
* **AFYA** – Health
* **BIASHARA** – Business & Industry
* **BUNGE** – Parliamentary records
* **DINI** – Religion
* **ELIMU** – Education
* **HABARI** – News
* **KILIMO** – Agriculture
* **MITANDAO** – Social Media
* **MASHIRIKA YA KIRAIA** – Civil Society / NGOs
* **SERIKALI** – Government
* **SHERIA** – Legal documents
* **SIASA** – Politics
A combined corpus file is also provided in the original release.
## Use Cases
* **LLM Pretraining** (LLaMA, Gemma, Qwen, Mistral)
* **Continual pretraining** for Swahili adaptation
* **Tokenizer training** for Kiswahili
* **Text generation**
* **Embedding models**
* **RAG (Retrieval-Augmented Generation)**
* **Linguistic and computational language research**
* **Low-resource language AI development**
## Data Collection & Processing (Original Authors)
According to the original publication, the dataset was created through the following steps:
1. Identification of Swahili content categories
2. Collection of documents from public and official sources
3. Downloading files in PDF and DOCX formats
4. Text extraction using Python scripts
5. Cleaning, normalization, and merging
6. Generation of corpus statistics
## Dataset Statistics
* **Total samples:** ~1,690,000
* **Splits:** `train` only
* **Language:** Swahili
* **Annotation:** None (raw text)
* **Domain coverage:** Government, news, health, education, agriculture, law, politics
## Limitations
* Dataset is **unannotated**
* May contain OCR artifacts or formatting noise
* Domain bias toward formal and government-related text
* Not guaranteed to be free of sensitive or outdated information
Users are encouraged to apply additional cleaning, filtering, or deduplication for production use.
## Citation
If you use this dataset, please cite **both the Hugging Face release and the original source**.
### Hugging Face Release
```bibtex
@dataset{ngusadeep_swahili_corpus_2025,
title = {Swahili Corpus Dataset},
author = {Ngusa, Samwel},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ngusadeep/Swahili-Corpus-Dataset},
note = {Curated, processed, and released on Hugging Face}
}
```
### Original Dataset
```bibtex
@dataset{masasi2024swahili,
title = {Swahili Corpus},
author = {Masasi, Noel and Masua, Bernard},
year = {2024},
publisher = {Mendeley Data},
version = {2},
doi = {10.17632/d4yhn5b9n6.2}
}
```
## License
* **Hugging Face Version:** Apache License 2.0
* **Original Dataset:** Creative Commons Attribution 4.0 (CC BY 4.0)
## Acknowledgments
This dataset is based on the **Swahili Corpus** originally created and published by
**Noel Masasi** and **Bernard Masua (2024)** on **Mendeley Data**.
The Hugging Face release was **curated, processed, unified, and documented** by
**Samwel Ngusa (ngusadeep)** to support large-scale NLP and LLM pretraining for Kiswahili.
## Maintainer
* **Samwel Ngusa** (`ngusadeep`)
For questions, issues, or improvements, please open a discussion on the dataset page:
[https://huggingface.co/datasets/ngusadeep/Swahili-Corpus-Dataset](https://huggingface.co/datasets/ngusadeep/Swahili-Corpus-Dataset)
🇹🇿 **Advancing Swahili NLP and African Language AI**
提供机构:
ngusadeep



