ngusadeep/Swahili-Corpus-Dataset

Name: ngusadeep/Swahili-Corpus-Dataset
Creator: ngusadeep
Published: 2026-01-06 07:36:57
License: 暂无描述

Hugging Face2026-01-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ngusadeep/Swahili-Corpus-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string splits: - name: train num_examples: 1693227 # total lines in all shards download_size: null dataset_size: null configs: - config_name: default data_files: - split: train path: Swahili_Corpus_combined.txt license: apache-2.0 language: - sw pretty_name: Swahili Corpus Dataset size_categories: - 1M<n<10M task_categories: - text-generation - feature-extraction tags: - swahili - kiswahili - africa - low-resource-language - llm-pretraining - text-corpus dataset_source: mendeley dataset_type: raw_text --- # Swahili Corpus Dataset **A large-scale Swahili text corpus for language model pretraining and NLP research.** The **Swahili Corpus Dataset** is a large-scale collection of Swahili (Kiswahili) text designed to support Natural Language Processing (NLP) research and the development of large language models (LLMs) for a low-resource African language. This dataset contains approximately **1.69 million Swahili text samples** aggregated from public and official sources. It is suitable for **LLM pretraining, continual pretraining, tokenizer training, embeddings, fine-tuning, and Retrieval-Augmented Generation (RAG)**. This Hugging Face release is a **curated and unified version** derived from the original **Swahili Corpus** published on **Mendeley Data** by **Noel Masasi** and **Bernard Masua (2024)**. ## Dataset Summary The Swahili Corpus Dataset aggregates real-world Swahili text from multiple thematic domains, including government, health, education, agriculture, law, and news. The corpus is **unannotated** and optimized for **unsupervised learning**, making it ideal for training and adapting foundation language models to Kiswahili. The goal of this dataset is to strengthen Swahili representation in modern NLP systems and to support African language AI research. ## Loading the Dataset You can load the dataset using 🤗 Datasets: ```python from datasets import load_dataset ds = load_dataset("ngusadeep/Swahili-Corpus-Dataset", split="train") # Access first example print(ds[0]["text"]) ``` ## Dataset Structure ### Data Fields Each example contains a single field: | Field | Type | Description | | ------ | ------ | ----------------------- | | `text` | string | Raw Swahili text sample | ### Example ```python { "text": "Serikali ya Jamhuri ya Muungano wa Tanzania imetoa taarifa rasmi kuhusu maendeleo ya sekta ya afya..." } ``` There are **no labels, annotations, or metadata fields**, making this dataset suitable for large-scale unsupervised training. ## Original Corpus Categories The original Swahili Corpus was organized into the following thematic categories, which were merged during preprocessing: * **AFYA** – Health * **BIASHARA** – Business & Industry * **BUNGE** – Parliamentary records * **DINI** – Religion * **ELIMU** – Education * **HABARI** – News * **KILIMO** – Agriculture * **MITANDAO** – Social Media * **MASHIRIKA YA KIRAIA** – Civil Society / NGOs * **SERIKALI** – Government * **SHERIA** – Legal documents * **SIASA** – Politics A combined corpus file is also provided in the original release. ## Use Cases * **LLM Pretraining** (LLaMA, Gemma, Qwen, Mistral) * **Continual pretraining** for Swahili adaptation * **Tokenizer training** for Kiswahili * **Text generation** * **Embedding models** * **RAG (Retrieval-Augmented Generation)** * **Linguistic and computational language research** * **Low-resource language AI development** ## Data Collection & Processing (Original Authors) According to the original publication, the dataset was created through the following steps: 1. Identification of Swahili content categories 2. Collection of documents from public and official sources 3. Downloading files in PDF and DOCX formats 4. Text extraction using Python scripts 5. Cleaning, normalization, and merging 6. Generation of corpus statistics ## Dataset Statistics * **Total samples:** ~1,690,000 * **Splits:** `train` only * **Language:** Swahili * **Annotation:** None (raw text) * **Domain coverage:** Government, news, health, education, agriculture, law, politics ## Limitations * Dataset is **unannotated** * May contain OCR artifacts or formatting noise * Domain bias toward formal and government-related text * Not guaranteed to be free of sensitive or outdated information Users are encouraged to apply additional cleaning, filtering, or deduplication for production use. ## Citation If you use this dataset, please cite **both the Hugging Face release and the original source**. ### Hugging Face Release ```bibtex @dataset{ngusadeep_swahili_corpus_2025, title = {Swahili Corpus Dataset}, author = {Ngusa, Samwel}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ngusadeep/Swahili-Corpus-Dataset}, note = {Curated, processed, and released on Hugging Face} } ``` ### Original Dataset ```bibtex @dataset{masasi2024swahili, title = {Swahili Corpus}, author = {Masasi, Noel and Masua, Bernard}, year = {2024}, publisher = {Mendeley Data}, version = {2}, doi = {10.17632/d4yhn5b9n6.2} } ``` ## License * **Hugging Face Version:** Apache License 2.0 * **Original Dataset:** Creative Commons Attribution 4.0 (CC BY 4.0) ## Acknowledgments This dataset is based on the **Swahili Corpus** originally created and published by **Noel Masasi** and **Bernard Masua (2024)** on **Mendeley Data**. The Hugging Face release was **curated, processed, unified, and documented** by **Samwel Ngusa (ngusadeep)** to support large-scale NLP and LLM pretraining for Kiswahili. ## Maintainer * **Samwel Ngusa** (`ngusadeep`) For questions, issues, or improvements, please open a discussion on the dataset page: [https://huggingface.co/datasets/ngusadeep/Swahili-Corpus-Dataset](https://huggingface.co/datasets/ngusadeep/Swahili-Corpus-Dataset) 🇹🇿 **Advancing Swahili NLP and African Language AI**

提供机构：

ngusadeep

5,000+

优质数据集

54 个

任务类型

进入经典数据集