AlexLeoTz/jamiiforums-2b
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AlexLeoTz/jamiiforums-2b
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- sw
pretty_name: JamiiForums 1.26B (Swahili ONLY) - Compute Optimal
size_categories:
- 100K<n<1M
task_categories:
- text-generation
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
dataset_info:
features:
- name: url
dtype: large_string
- name: title
dtype: large_string
- name: content
dtype: large_string
splits:
- name: train
num_bytes: 2769300260
num_examples: 881373
- name: validation
num_bytes: 153849490
num_examples: 48965
- name: test
num_bytes: 153852632
num_examples: 48966
download_size: 1985863081
dataset_size: 3077002382
---
# JamiiForums 1.26B (Swahili ONLY) - Compute Optimal
This is a **duplication-free, high-quality, largest Swahili ONLY corpus**. It is specifically designed for pre-training a **compute-optimal language model of approximately 63 million parameters** (following Chinchilla scaling laws for a 1.26B token corpus). All bit-for-bit identical content has been removed to ensure high training quality.
## Dataset Status
- **Unique Items**: 979,304
- **Cleaned Tokens**: ~1.262 Billion Tokens.
- **Deduplication Method**: Exact Content Hash (Python/Pandas).
- **Reduction**: ~58% of original raw items were duplicates.
## Training Splits
- **Train**: 881,373 rows.
- **Validation**: 48,965 rows.
- **Test**: 48,966 rows.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("AlexLeoTz/jamiiforums-2b", split="train", streaming=True)
```
提供机构:
AlexLeoTz



