NaolBM/african-corpus

Name: NaolBM/african-corpus
Creator: NaolBM
Published: 2026-02-23 10:25:31
License: 暂无描述

Hugging Face2026-02-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/NaolBM/african-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: language dtype: string splits: - name: train num_bytes: 10537058711 num_examples: 35417771 download_size: 5053335344 dataset_size: 10537058711 configs: - config_name: default data_files: - split: train path: data/train-* --- # Comprehensive African Languages Dataset ## 📊 Dataset Overview - **Total rows:** 35,344,339 - **Languages:** 7 African languages + English - **Features:** `text` | `language` ## 🌍 Language Distribution | Language | Count | Percentage | Distribution | |----------|------------|------------|--------------------------------| | sw | 14,127,076 | 39.89% | ███████████░░░░░░░░░░░░░░░░░░░ | | am | 10,815,255 | 30.54% | █████████░░░░░░░░░░░░░░░░░░░░░ | | ha | 7,180,569 | 20.27% | ██████░░░░░░░░░░░░░░░░░░░░░░░░ | | en | 2,119,719 | 5.98% | █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | om | 883,420 | 2.49% | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | yo | 279,656 | 0.79% | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | ti | 12,076 | 0.03% | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | **TOTAL** | **35,417,771** | **100.00%** | | -------------------------------------------------- ## 📝 Dataset Sources | Source | Language | Rows | |--------|----------|------| | amharic-combined | am | 10,056,352 | | bible | am | 30,752 | | wikipedia | am | 13,906 | | amharic_corpus | am | 707,649 | | tinystories | en | 2,119,719 | | afaanoromoo | om | 410,841 | | oromo_wiki | om | 1,970 | | tigrinya | ti | 12,076 | | amharic_books | am | 6,596 | | oromo_name | om | 28,426 | | hausa | ha | 1,282,997 | | swahili | sw | 1,442,912 | | yoruba | yo | 149,148 | | swahili_news | sw | 22,207 | | swahili_corpus | sw | 12,660,806 | | hausa_translation | ha | 5,861,080 | | yoruba_synth | yo | 20,156 | | cc100_yoruba | yo | 76,533 | ## 🚀 Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("NaolBM/african-corpus") # Access by language amharic = dataset.filter(lambda x: x['language'] == 'am') swahili = dataset.filter(lambda x: x['language'] == 'sw') # Train/val split dataset = dataset.train_test_split(test_size=0.01, seed=42) train_data = dataset['train']

提供机构：

NaolBM

5,000+

优质数据集

54 个

任务类型

进入经典数据集