AlexLeoTz/jamiiforums-2b

Name: AlexLeoTz/jamiiforums-2b
Creator: AlexLeoTz
Published: 2026-03-25 19:54:27
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/AlexLeoTz/jamiiforums-2b

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - sw pretty_name: JamiiForums 1.26B (Swahili ONLY) - Compute Optimal size_categories: - 100K<n<1M task_categories: - text-generation configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: url dtype: large_string - name: title dtype: large_string - name: content dtype: large_string splits: - name: train num_bytes: 2769300260 num_examples: 881373 - name: validation num_bytes: 153849490 num_examples: 48965 - name: test num_bytes: 153852632 num_examples: 48966 download_size: 1985863081 dataset_size: 3077002382 --- # JamiiForums 1.26B (Swahili ONLY) - Compute Optimal This is a **duplication-free, high-quality, largest Swahili ONLY corpus**. It is specifically designed for pre-training a **compute-optimal language model of approximately 63 million parameters** (following Chinchilla scaling laws for a 1.26B token corpus). All bit-for-bit identical content has been removed to ensure high training quality. ## Dataset Status - **Unique Items**: 979,304 - **Cleaned Tokens**: ~1.262 Billion Tokens. - **Deduplication Method**: Exact Content Hash (Python/Pandas). - **Reduction**: ~58% of original raw items were duplicates. ## Training Splits - **Train**: 881,373 rows. - **Validation**: 48,965 rows. - **Test**: 48,966 rows. ## Usage ```python from datasets import load_dataset ds = load_dataset("AlexLeoTz/jamiiforums-2b", split="train", streaming=True) ```

提供机构：

AlexLeoTz

5,000+

优质数据集

54 个

任务类型

进入经典数据集