nileagi/swahili-language-exposure

Name: nileagi/swahili-language-exposure
Creator: nileagi
Published: 2026-01-28 08:24:27
License: 暂无描述

Hugging Face2026-01-28 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/nileagi/swahili-language-exposure

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - sw task_categories: - text-generation - question-answering - feature-extraction tags: - swahili - language-exposure - pretraining - low-resource-language - african-languages - conversational-text pretty_name: Swahili Language Exposure (NileAGI) size_categories: unknown configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 3077404935 num_examples: 1589831 - name: validation num_bytes: 161180380 num_examples: 83676 download_size: 1936721770 dataset_size: 3238585315 --- # swahili-language-exposure ## Dataset Summary **swahili-language-exposure** is a large-scale Swahili (Kiswahili) corpus designed for **language exposure and continued pretraining** of language models. Unlike instruction-tuning datasets, this dataset focuses on exposing models to **natural Swahili usage** across conversations, explanations, narratives, technical discussions, and mixed-domain text. The goal is to improve **fluency, vocabulary coverage, syntax, and cultural grounding** in Swahili. This dataset is developed and maintained by **NileAGI**. --- ## Dataset Purpose This dataset is intended for: * Continued pretraining (DAPT / CPT) * Language exposure before instruction tuning * Improving Swahili fluency and coherence * Reducing English dominance in multilingual models * Low-resource language research It is **not optimized for instruction-following by default**. --- ## Dataset Format The dataset is provided in **JSON Lines (`.jsonl`)** format. Each line contains a **single Swahili text sample**, without enforced instruction–response structure: ```json { "text": "Maudhui ya asili kwa Kiswahili yaliyoandikwa au kuzungumzwa katika mazingira halisi." } ```` ### Example ```json {"text":"Nilikuwa najifunza kuhusu mitandao ya neva na jinsi inavyotumika katika utambuzi wa picha, lakini changamoto kubwa ilikuwa kupata data iliyosawazika."} ``` --- ## Supported Tasks * Language modeling * Text generation * Conversational fluency improvement * Vocabulary and grammar learning * Domain adaptation for Swahili * Foundation training for downstream fine-tuning --- ## Language * **Swahili (Kiswahili)** — primary language (`sw`) * Natural code-switching with English technical terms may appear * Informal and semi-formal registers are both present --- ## Data Sources The dataset was curated from: * Educational explanations and mentoring sessions * Informal dialogues and narrative text * Mixed-domain Swahili content reflecting real usage All data has been anonymized and cleaned. --- ## Data Preprocessing The following preprocessing steps were applied: * Filtering of extremely short or low-quality text * Normalization of whitespace and encoding * Preservation of natural language flow and code-switching No artificial instruction templates were added. --- ## Intended Use ### Primary Use * Continued pretraining (CPT/DAPT) for Swahili * Language exposure before instruction tuning * Improving multilingual model balance ### Secondary Use * Linguistic analysis * Swahili NLP benchmarking * Data augmentation for low-resource research --- ## Out-of-Scope Uses This dataset is **not intended** for: * Direct instruction fine-tuning (see companion instruct dataset) * Preference learning or RLHF * Surveillance, profiling, or harmful content generation --- ## Biases and Limitations * Informal conversational Swahili is more common than formal prose * Technical domains may be overrepresented * Regional phrasing reflects contributor backgrounds Users should consider complementary corpora for broader coverage. --- ## Ethical Considerations * Personally identifiable information has been removed or masked * No sensitive personal attributes are intentionally included * Released strictly for responsible AI research --- ## License This dataset is released under the **MIT License**. --- ## Citation If you use this dataset, please cite: ```bibtex @dataset{swahili_language_exposure, title = {swahili-language-exposure: A Swahili Language Exposure Dataset}, author = {NileAGI}, year = {2026}, publisher = {Hugging Face} } ```

提供机构：

nileagi

5,000+

优质数据集

54 个

任务类型

进入经典数据集