five

NaolBM/african-corpus

收藏
Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NaolBM/african-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string - name: language dtype: string splits: - name: train num_bytes: 10537058711 num_examples: 35417771 download_size: 5053335344 dataset_size: 10537058711 configs: - config_name: default data_files: - split: train path: data/train-* --- # Comprehensive African Languages Dataset ## 📊 Dataset Overview - **Total rows:** 35,344,339 - **Languages:** 7 African languages + English - **Features:** `text` | `language` ## 🌍 Language Distribution | Language | Count | Percentage | Distribution | |----------|------------|------------|--------------------------------| | sw | 14,127,076 | 39.89% | ███████████░░░░░░░░░░░░░░░░░░░ | | am | 10,815,255 | 30.54% | █████████░░░░░░░░░░░░░░░░░░░░░ | | ha | 7,180,569 | 20.27% | ██████░░░░░░░░░░░░░░░░░░░░░░░░ | | en | 2,119,719 | 5.98% | █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | om | 883,420 | 2.49% | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | yo | 279,656 | 0.79% | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | ti | 12,076 | 0.03% | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | **TOTAL** | **35,417,771** | **100.00%** | | -------------------------------------------------- ## 📝 Dataset Sources | Source | Language | Rows | |--------|----------|------| | amharic-combined | am | 10,056,352 | | bible | am | 30,752 | | wikipedia | am | 13,906 | | amharic_corpus | am | 707,649 | | tinystories | en | 2,119,719 | | afaanoromoo | om | 410,841 | | oromo_wiki | om | 1,970 | | tigrinya | ti | 12,076 | | amharic_books | am | 6,596 | | oromo_name | om | 28,426 | | hausa | ha | 1,282,997 | | swahili | sw | 1,442,912 | | yoruba | yo | 149,148 | | swahili_news | sw | 22,207 | | swahili_corpus | sw | 12,660,806 | | hausa_translation | ha | 5,861,080 | | yoruba_synth | yo | 20,156 | | cc100_yoruba | yo | 76,533 | ## 🚀 Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("NaolBM/african-corpus") # Access by language amharic = dataset.filter(lambda x: x['language'] == 'am') swahili = dataset.filter(lambda x: x['language'] == 'sw') # Train/val split dataset = dataset.train_test_split(test_size=0.01, seed=42) train_data = dataset['train']
提供机构:
NaolBM
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作