five

AndyOnyango/KenPOS

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AndyOnyango/KenPOS
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - luo - bxk - lri - rag license: cc-by-4.0 task_categories: - token-classification tags: - kenyan-languages - dholuo - lubukusu - lumarachi - lulogooli - pos-tagging - low-resource-languages - african-languages pretty_name: KenPOS size_categories: - 100K<n<1M configs: - config_name: dho data_files: "dho/*.parquet" - config_name: lbk data_files: "lbk/*.parquet" - config_name: lch data_files: "lch/*.parquet" - config_name: llg data_files: "llg/*.parquet" --- # KenPOS: Kenyan Languages Part-of-Speech Tagged Dataset ## Dataset Description **KenPOS** is a part-of-speech (POS) tagged corpus for Kenyan languages, featuring **156,994 tokens** across four languages. The dataset provides manually annotated POS tags for low-resource Kenyan languages, enabling NLP research and applications. ## Dataset Statistics | Language | Code | Tokens | Sentences | Files | Unique POS Tags | |----------|------|--------|-----------|-------|-----------------| | Dholuo | dho | 54,712 | 70 | 168 | 114 | | Lubukusu | lbk | 51,900 | 154 | 62 | 97 | | Lumarachi| lch | 25,917 | 27 | 212 | 78 | | Lulogooli| llg | 24,465 | 290 | 121 | 75 | | **Total**| |**156,994**|**541**|**563**| | ## Languages & Codes | Language / Dialect | Code | Family / Notes | |---------------------|------|-------------------------| | Dholuo (Luo) | dho | Nilotic (western Kenya) | | Lubukusu (Bukusu) | lbk | Bantu, Luhya dialect | | Lumarachi (Marachi) | lch | Bantu, Luhya dialect | | Lulogooli (Logooli) | llg | Bantu, Luhya dialect | ## Dataset Format The dataset is distributed as **Parquet files** for optimal performance and compatibility: - **Format**: Apache Parquet (columnar storage) - **Encoding**: UTF-8 - **File naming**: `{language}/train.parquet` - **Compatibility**: Works with `datasets` 4.0.0+ without custom loading scripts --- ## Data Fields Each record in the dataset contains: - **token**: `string` - The word or token - **pos_tag**: `string` - Part-of-speech tag (e.g., NN, V, ADJ, PUNCT) - **sentence_id**: `int` - Unique identifier for the sentence - **position**: `int` - Position of the token within the sentence (0-indexed) - **filename**: `string` - Source filename from which the token was extracted ### Example Record ```python { 'token': 'Kezia', 'pos_tag': 'NN', 'sentence_id': 0, 'position': 0, 'filename': '4411_dho_pos.csv' } ``` --- ## Usage ### Loading with 🤗 Datasets **Compatible with datasets 4.0.0+** (No `trust_remote_code` needed!) ```python from datasets import load_dataset # Load Dholuo POS dataset dho = load_dataset("Kencorpus/KenPOS", "dho") # Load Lubukusu POS dataset lbk = load_dataset("Kencorpus/KenPOS", "lbk") # Load Lumarachi POS dataset lch = load_dataset("Kencorpus/KenPOS", "lch") # Load Lulogooli POS dataset llg = load_dataset("Kencorpus/KenPOS", "llg") # Access the data print(dho['train'][0]) # Output: {'token': 'Kezia', 'pos_tag': 'NN', 'sentence_id': 0, 'position': 0, 'filename': '4411_dho_pos.csv'} ``` ### Reconstructing Sentences ```python from datasets import load_dataset import pandas as pd # Load dataset dho = load_dataset("Kencorpus/KenPOS", "dho") df = pd.DataFrame(dho['train']) # Get first sentence sentence_0 = df[df['sentence_id'] == 0].sort_values('position') print(' '.join(sentence_0['token'].tolist())) ``` ### Analyzing POS Tags ```python from datasets import load_dataset import pandas as pd # Load dataset dho = load_dataset("Kencorpus/KenPOS", "dho") df = pd.DataFrame(dho['train']) # Count POS tag frequencies pos_counts = df['pos_tag'].value_counts() print(pos_counts.head(10)) ``` --- ## POS Tag Categories The dataset uses a variety of POS tags including: - **NN** - Noun - **V** - Verb - **ADJ/Adj.** - Adjective - **ADV/Adv** - Adverb - **PRON** - Pronoun - **ADP** - Adposition (preposition/postposition) - **DET/Det.** - Determiner - **CONJ/Conj.** - Conjunction - **NUM** - Numeral - **PUNCT/PUNC** - Punctuation - And many more fine-grained categories **Note**: Tag naming conventions may vary slightly across files (e.g., PUNCT vs PUNC, ADJ vs Adj.). --- ## Dataset Curators - **Florence Indede** (Maseno University) - **Owen McOnyango** (Maseno University) - **Lilian D.A. Wanzare** (Maseno University) - **Barack Wanjawa** (University of Nairobi) - **Edward Ombui** (Africa Nazarene University) - **Lawrence Muchemi** (University of Nairobi) --- ## Citation If you use this dataset in your research, please cite: ```bibtex @article{wanjawa2022kencorpus, title={Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks}, author={Wanjawa, Barack W. and Wanzare, Lilian D. and Indede, Florence and McOnyango, Owen and Ombui, Edward and Muchemi, Lawrence}, journal={arXiv preprint arXiv:2208.12081}, year={2022} } ``` --- ## Links - **Research Paper**: https://arxiv.org/abs/2208.12081 - **Dataverse**: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KLCKL5 - **ResearchGate**: https://www.researchgate.net/publication/371767223 - **Semantic Scholar**: https://www.semanticscholar.org/paper/8cf70c5cd8b195ed7a399ea2cdc0b0e8f08c61ce --- ## License This dataset is licensed under **CC-BY-4.0**. --- ## Acknowledgments This dataset is part of the **Kencorpus** project, which aims to create NLP resources for low-resource Kenyan languages.
提供机构:
AndyOnyango
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作