NaolBM/african-corpus
收藏Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NaolBM/african-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: language
dtype: string
splits:
- name: train
num_bytes: 10537058711
num_examples: 35417771
download_size: 5053335344
dataset_size: 10537058711
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Comprehensive African Languages Dataset
## 📊 Dataset Overview
- **Total rows:** 35,344,339
- **Languages:** 7 African languages + English
- **Features:** `text` | `language`
## 🌍 Language Distribution
| Language | Count | Percentage | Distribution |
|----------|------------|------------|--------------------------------|
| sw | 14,127,076 | 39.89% | ███████████░░░░░░░░░░░░░░░░░░░ |
| am | 10,815,255 | 30.54% | █████████░░░░░░░░░░░░░░░░░░░░░ |
| ha | 7,180,569 | 20.27% | ██████░░░░░░░░░░░░░░░░░░░░░░░░ |
| en | 2,119,719 | 5.98% | █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ |
| om | 883,420 | 2.49% | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ |
| yo | 279,656 | 0.79% | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ |
| ti | 12,076 | 0.03% | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ |
| **TOTAL** | **35,417,771** | **100.00%** | |
--------------------------------------------------
## 📝 Dataset Sources
| Source | Language | Rows |
|--------|----------|------|
| amharic-combined | am | 10,056,352 |
| bible | am | 30,752 |
| wikipedia | am | 13,906 |
| amharic_corpus | am | 707,649 |
| tinystories | en | 2,119,719 |
| afaanoromoo | om | 410,841 |
| oromo_wiki | om | 1,970 |
| tigrinya | ti | 12,076 |
| amharic_books | am | 6,596 |
| oromo_name | om | 28,426 |
| hausa | ha | 1,282,997 |
| swahili | sw | 1,442,912 |
| yoruba | yo | 149,148 |
| swahili_news | sw | 22,207 |
| swahili_corpus | sw | 12,660,806 |
| hausa_translation | ha | 5,861,080 |
| yoruba_synth | yo | 20,156 |
| cc100_yoruba | yo | 76,533 |
## 🚀 Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("NaolBM/african-corpus")
# Access by language
amharic = dataset.filter(lambda x: x['language'] == 'am')
swahili = dataset.filter(lambda x: x['language'] == 'sw')
# Train/val split
dataset = dataset.train_test_split(test_size=0.01, seed=42)
train_data = dataset['train']
提供机构:
NaolBM



