nileagi/swahili-language-exposure
收藏Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nileagi/swahili-language-exposure
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- sw
task_categories:
- text-generation
- question-answering
- feature-extraction
tags:
- swahili
- language-exposure
- pretraining
- low-resource-language
- african-languages
- conversational-text
pretty_name: Swahili Language Exposure (NileAGI)
size_categories: unknown
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 3077404935
num_examples: 1589831
- name: validation
num_bytes: 161180380
num_examples: 83676
download_size: 1936721770
dataset_size: 3238585315
---
# swahili-language-exposure
## Dataset Summary
**swahili-language-exposure** is a large-scale Swahili (Kiswahili) corpus designed for **language exposure and continued pretraining** of language models.
Unlike instruction-tuning datasets, this dataset focuses on exposing models to **natural Swahili usage** across conversations, explanations, narratives, technical discussions, and mixed-domain text. The goal is to improve **fluency, vocabulary coverage, syntax, and cultural grounding** in Swahili.
This dataset is developed and maintained by **NileAGI**.
---
## Dataset Purpose
This dataset is intended for:
* Continued pretraining (DAPT / CPT)
* Language exposure before instruction tuning
* Improving Swahili fluency and coherence
* Reducing English dominance in multilingual models
* Low-resource language research
It is **not optimized for instruction-following by default**.
---
## Dataset Format
The dataset is provided in **JSON Lines (`.jsonl`)** format.
Each line contains a **single Swahili text sample**, without enforced instruction–response structure:
```json
{
"text": "Maudhui ya asili kwa Kiswahili yaliyoandikwa au kuzungumzwa katika mazingira halisi."
}
````
### Example
```json
{"text":"Nilikuwa najifunza kuhusu mitandao ya neva na jinsi inavyotumika katika utambuzi wa picha, lakini changamoto kubwa ilikuwa kupata data iliyosawazika."}
```
---
## Supported Tasks
* Language modeling
* Text generation
* Conversational fluency improvement
* Vocabulary and grammar learning
* Domain adaptation for Swahili
* Foundation training for downstream fine-tuning
---
## Language
* **Swahili (Kiswahili)** — primary language (`sw`)
* Natural code-switching with English technical terms may appear
* Informal and semi-formal registers are both present
---
## Data Sources
The dataset was curated from:
* Educational explanations and mentoring sessions
* Informal dialogues and narrative text
* Mixed-domain Swahili content reflecting real usage
All data has been anonymized and cleaned.
---
## Data Preprocessing
The following preprocessing steps were applied:
* Filtering of extremely short or low-quality text
* Normalization of whitespace and encoding
* Preservation of natural language flow and code-switching
No artificial instruction templates were added.
---
## Intended Use
### Primary Use
* Continued pretraining (CPT/DAPT) for Swahili
* Language exposure before instruction tuning
* Improving multilingual model balance
### Secondary Use
* Linguistic analysis
* Swahili NLP benchmarking
* Data augmentation for low-resource research
---
## Out-of-Scope Uses
This dataset is **not intended** for:
* Direct instruction fine-tuning (see companion instruct dataset)
* Preference learning or RLHF
* Surveillance, profiling, or harmful content generation
---
## Biases and Limitations
* Informal conversational Swahili is more common than formal prose
* Technical domains may be overrepresented
* Regional phrasing reflects contributor backgrounds
Users should consider complementary corpora for broader coverage.
---
## Ethical Considerations
* Personally identifiable information has been removed or masked
* No sensitive personal attributes are intentionally included
* Released strictly for responsible AI research
---
## License
This dataset is released under the **MIT License**.
---
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{swahili_language_exposure,
title = {swahili-language-exposure: A Swahili Language Exposure Dataset},
author = {NileAGI},
year = {2026},
publisher = {Hugging Face}
}
```
提供机构:
nileagi



