AndyOnyango/KenTrans

Name: AndyOnyango/KenTrans
Creator: AndyOnyango
Published: 2026-04-10 06:13:39
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/AndyOnyango/KenTrans

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - luo - bxk - lri - rag - swa license: cc-by-4.0 task_categories: - translation tags: - kenyan-languages - dholuo - lubukusu - lumarachi - lulogooli - swahili - low-resource-languages - african-languages pretty_name: KenTrans size_categories: - 10K<n<100K configs: - config_name: dho data_files: "dho/*.parquet" - config_name: lbk data_files: "lbk/*.parquet" - config_name: lch data_files: "lch/*.parquet" - config_name: llg data_files: "llg/*.parquet" --- # KenTrans: Kenyan Languages to Swahili Translation Dataset ## Dataset Structure **KenTrans** is a parallel corpus between **Swahili** and three Kenyan languages (with multiple Luhya dialects). The dataset contains **11,795** sentence pairs translated **into Swahili**: - **Dholuo → Swahili:** 4,222 pairs - **Luhya → Swahili (total):** 7,573 pairs across three dialects - **Lumarachi (lch):** 2,475 pairs - **Lulogooli (llg):** 3,692 pairs - **Lubukusu (lbk):** 1,406 pairs The dataset is provided in **Parquet format**, which is compatible with Hugging Face `datasets` library version 4.0.0 and above. Each example contains parallel text with the following fields: - **source**: Original sentence in the source language - **target**: Translation in Swahili > Example > ```python > { > 'source': 'OSIEPE MA KENDE', > 'target': 'MARAFIKI WA DHATI' > } > ``` ### Languages & Codes | Language / Dialect | Code | Family / Notes | |--------------------|------|-------------------------------| | Dholuo (Luo) | dho | Nilotic (western Kenya) | | Lubukusu (Bukusu) | lbk | Bantu, Luhya dialect | | Lumarachi (Marachi)| lch | Bantu, Luhya dialect | | Lulogooli (Logooli)| llg | Bantu, Luhya dialect | | Swahili | swa | Target language for all pairs | --- ## Usage ### Loading with 🤗 Datasets **Compatible with datasets 4.0.0+** (No `trust_remote_code` needed!) ```python from datasets import load_dataset # Load Dholuo → Swahili dho = load_dataset("Kencorpus/KenTrans", "dho") # Load Lubukusu → Swahili lbk = load_dataset("Kencorpus/KenTrans", "lbk") # Load Lumarachi → Swahili lch = load_dataset("Kencorpus/KenTrans", "lch") # Load Lulogooli → Swahili llg = load_dataset("Kencorpus/KenTrans", "llg") # Access the data print(dho['train'][0]) # Output: {'id': 'dho_dho_combined.txt_0', 'source': '6AM DALA FM NEWS...', 'target': 'VIDOKEZI VYA HABARI...', ...} ``` --- ### Dataset Format The dataset is stored in **Parquet format** with the following structure: - Each language pair has its own Parquet file (e.g., `dho-train.parquet`) - Each row represents a parallel sentence pair - All metadata is included in the Parquet schema ## Dataset Format The dataset is distributed as **Parquet files** for optimal performance and compatibility: - **Format**: Apache Parquet (columnar storage) - **Encoding**: UTF-8 - **File naming**: `{language}-train.parquet` (e.g., `dho-train.parquet`) - **Compatibility**: Works with `datasets` 4.0.0+ without custom loading scripts --- ## Data Fields (Programmatic View) When parsed into records, each example can be represented with: - `id`: Unique identifier (e.g., `dho_0773_dho_tr.txt_0`) - `source`: Source language sentence - `target`: Target Swahili sentence - `src_lang`: One of `dho`, `lbk`, `lch`, `llg` - `tgt_lang`: Always `swa` - `pair`: Language pair as `"{src}-{tgt}"` (e.g., `dho-swa`) - `filename`: Source filename (e.g., `0773_dho_tr.txt`) --- ## Sources ## Translators (Acknowledgements) **Dholuo → Swahili** - Mercy Lavinca Oduoll (Coordinator) - Bildad Okebe - Immaculate Ochieng - Mary Muma. **Luhya (Logooli) → Swahili** - Phillip Lumwamu (Coordinator) - Kints Mugoha Musungu - Vivian Alivitsa - Joseph Ambwere - Joyline Ingasiani. **Luhya (Bukusu) → Swahili** - Martin Barasa Mulwale (Coordinator) - Samwel Ralph Nyongesa - Tobias Shikuku - Phelisters N Simiyu **Luhya (Marachi) → Swahili** - Judith Awinja (Coordinator) - Evans Owino - Belinda Oduor - Frankline Mwaro ## Dataset Curator - Indede, Florence (Maseno University) - McOnyango, Owen (Maseno University) - Wanzare, Lilian D.A. (Maseno University) - Wanjawa, Barack (University of Nairobi) - Ombui, Edward (Africa Nazarene University) - Muchemi, Lawrence (University of Nairobi) ## Research Paper - Wanjawa, B.W., Wanzare, L.D., Indede, F., McOnyango, O., Ombui, E., & Muchemi, L. (2022). Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks. ArXiv, abs/2208.12081. https://arxiv.org/abs/2208.12081 ## Links Dataverse: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KLCKL5

提供机构：

AndyOnyango

5,000+

优质数据集

54 个

任务类型

进入经典数据集