AlexLeoTz/swahili_large_corpus_i

Name: AlexLeoTz/swahili_large_corpus_i
Creator: AlexLeoTz
Published: 2026-05-12 09:22:39
License: 暂无描述

Hugging Face2026-05-12 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/AlexLeoTz/swahili_large_corpus_i

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other language: - sw - en pretty_name: Swahili Large Corpus (v1) size_categories: - 10M<n<100M task_categories: - text-generation - fill-mask tags: - swahili - pretraining - nlp - african-languages --- # Swahili Large Corpus (v1) - 5.75 Billion Tokens ## Overview The **Swahili Large Corpus (v1)** is one of the largest and most diverse open pretraining datasets for the Swahili language. It was engineered for training compute-optimal large language models (LLMs), combining multiple high-quality Swahili and multilingual sources into a single, rigorously deduplicated Parquet dataset. Following **Chinchilla scaling laws** (`D = 20N`), this corpus is ideally suited for training a compute-optimal model of approximately **~287 Million parameters**. All data in this corpus has been globally deduplicated using exact SHA-256 content hashing to remove identical documents across all sources. --- ## Real-World Implementation: Zuhura 289M This dataset served as the primary pre-training foundation for **Zuhura 289M**, a Swahili-centric LLM. **Training Metrics (Zuhura 289M Base):** - **Total Tokens Processed:** ~5 Billion - **Hardware:** Single instance of NVIDIA A100 40GB - **Total Steps:** 37,400 - **Final Pre-training Loss:** **2.723** - **Epoch:** ~1.0 The success of Zuhura 289M validates the high quality and diversity of this corpus for building fluent, context-aware Swahili language models. --- ## Dataset Specifications - **Total Tokens:** ~5.751 Billion (GPT-2 tokenizer) - **Total Unique Documents:** 52,713,713 - **Average Tokens per Document:** ~109 - **Format:** Optimized Parquet Shards - **Primary Language:** Swahili (sw) - **Secondary Language:** English (en), Code - **Split:** Train (90%) / Validation (5%) / Test (5%) --- ## Data Sources The dataset is a mixture of the following sources, all globally deduplicated after merging: ### Swahili Forum Discussions (1.317B tokens — original contribution) Web-scraped long-form discussions from JamiiForums, the largest Swahili-speaking online community platform. Covers politics, culture, religion, health, sports, and everyday life. This is the original contribution of this dataset, collected and cleaned by the dataset author. The raw scrape produced ~9B tokens, which after global exact deduplication yielded 1.317B clean, unique tokens across 1,144,358 documents. ### Alfaxad/Inkuba-Mono-Swahili A Swahili monolingual pretraining corpus from the Inkuba project, focused on African language NLP. ### Mollel/swahili_pretrain_data A community-contributed Swahili pretraining text dataset. ### Benjamin-png/swahili-normalized-corpus A normalized Swahili text corpus with preprocessing applied for improved text quality. ### ngusadeep/Swahili-Corpus-Dataset A general Swahili corpus dataset for NLP research and language model training. ### Adeptschneider/CiviVox-Swahili-text-corpus-v2.0 A Swahili text corpus derived from audio transcriptions and civic discourse content. ### mwitiderrick/swahili A Swahili language dataset for pretraining and fine-tuning language models. ### community-datasets/swahili_news A dataset of Swahili news articles covering local and international topics. ### wikimedia/wikipedia (20231101.sw) The November 2023 Swahili Wikipedia snapshot. Provides factual, structured, and formal encyclopedic Swahili prose. ### karpathy/tinystories-gpt4-clean A dataset of short, simple English stories generated with GPT-4. Included to improve narrative comprehension and creative language understanding across languages. ### HuggingFaceFW/fineweb-edu (sample-10BT, 1M rows sampled) A high-quality filtered subset of English educational web content from the FineWeb project. Only 1 million rows were sampled to improve cross-lingual transfer and general reasoning. ### ajibawa-2023/Python-Code-Large (1M rows sampled) Python source code from open-source repositories. Included to enhance logical reasoning and structured thinking capabilities. Only 1 million rows were sampled. --- ## Processing Pipeline 1. **Ingestion**: Each source was loaded and normalized to a unified `text` field schema. 2. **Quality Filtering**: Documents shorter than 50 characters were removed. 3. **Global Exact Deduplication**: SHA-256 content hashing was applied across the entire combined corpus. Identical documents are removed regardless of which source they came from. 4. **Shuffle & Split**: Deterministic shuffle (seed=42) followed by a 90/5/5 train/validation/test split. --- ## Statistical Summary | Feature | Value | | :--- | :--- | | **Total Tokens** | ~5,751,000,000 | | **Total Unique Documents** | 52,713,713 | | **JamiiForums Contribution** | 1,144,358 docs / 1.317B tokens | | **Average Chars per Token** | 2.33 | | **Compute-Optimal Model Size** | ~287M parameters | | **Chinchilla Ratio (D/N)** | 20x | --- ## Compute-Optimal Training Guide Using the Chinchilla formula `N_opt = D / 20`: | Tokens Available | Optimal Model Size | | :--- | :--- | | 1.317B (JamiiForums only) | ~66M parameters | | 5.751B (This corpus, v1) | ~287M parameters | | 20B (projected after full scrape) | ~1B parameters | --- ## Use Cases - **Swahili LLM Pre-training**: Train a production-grade Swahili base model. - **Cross-lingual Transfer**: Fine-tune English LLMs (e.g., Llama, Mistral) for Swahili. - **NLP Research**: Sentiment analysis, NER, text classification, machine translation. - **Benchmarking**: Evaluate Swahili language understanding in multilingual models. --- ## Licensing This dataset is a compilation of multiple sources, each carrying its own license. Users of this dataset are responsible for complying with the terms of each individual source. The table below summarizes the known licenses for each source: | Source | License | | :--- | :--- | | JamiiForums (original scrape) | CC BY-NC-SA 4.0 | | Alfaxad/Inkuba-Mono-Swahili | CC BY-NC 4.0 | | Mollel/swahili_pretrain_data | CC BY 4.0 | | Benjamin-png/swahili-normalized-corpus | MIT | | ngusadeep/Swahili-Corpus-Dataset | Apache 2.0 | | Adeptschneider/CiviVox-Swahili-text-corpus-v2.0 | Apache 2.0 | | mwitiderrick/swahili | Apache 2.0 | | community-datasets/swahili_news | CC BY 4.0 | | wikimedia/wikipedia (sw) | CC BY-SA 4.0 | | karpathy/tinystories-gpt4-clean | CDLA-Sharing-1.0 | | HuggingFaceFW/fineweb-edu | ODC-By 1.0 | | ajibawa-2023/Python-Code-Large | Apache 2.0 | **Note**: Because this corpus includes data under CC BY-NC-SA 4.0 (the JamiiForums scrape) and CC BY-SA 4.0 (Wikipedia), the compiled dataset as a whole should be treated as **non-commercial** unless you exclude those specific sources from your use. --- ## Contact Curated by [AlexLeoTz](https://huggingface.co/AlexLeoTz). If you use this dataset in your research, please credit the author and link back to this repository.

提供机构：

AlexLeoTz

5,000+

优质数据集

54 个

任务类型

进入经典数据集