five

AlexLeoTz/jamiiforums-2b

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AlexLeoTz/jamiiforums-2b
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - sw pretty_name: JamiiForums 1.26B (Swahili ONLY) - Compute Optimal size_categories: - 100K<n<1M task_categories: - text-generation configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: url dtype: large_string - name: title dtype: large_string - name: content dtype: large_string splits: - name: train num_bytes: 2769300260 num_examples: 881373 - name: validation num_bytes: 153849490 num_examples: 48965 - name: test num_bytes: 153852632 num_examples: 48966 download_size: 1985863081 dataset_size: 3077002382 --- # JamiiForums 1.26B (Swahili ONLY) - Compute Optimal This is a **duplication-free, high-quality, largest Swahili ONLY corpus**. It is specifically designed for pre-training a **compute-optimal language model of approximately 63 million parameters** (following Chinchilla scaling laws for a 1.26B token corpus). All bit-for-bit identical content has been removed to ensure high training quality. ## Dataset Status - **Unique Items**: 979,304 - **Cleaned Tokens**: ~1.262 Billion Tokens. - **Deduplication Method**: Exact Content Hash (Python/Pandas). - **Reduction**: ~58% of original raw items were duplicates. ## Training Splits - **Train**: 881,373 rows. - **Validation**: 48,965 rows. - **Test**: 48,966 rows. ## Usage ```python from datasets import load_dataset ds = load_dataset("AlexLeoTz/jamiiforums-2b", split="train", streaming=True) ```
提供机构:
AlexLeoTz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作