omanyasa/zim-langid-v2
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/omanyasa/zim-langid-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- sn
- ny
- nd
- to
license: mit
multilinguality:
- multilingual
size_categories:
- 100K<n<1M
task_categories:
- text-classification
task_ids:
- language-identification
pretty_name: Zimbabwe Multilingual Language Identification Dataset
tags:
- languages
- zimbabwe
- africa
- language-identification
- english
- shona
- chewa
- ndebele
- tonga
- low-resource
- multilingual
---
# Dataset Card for Zimbabwe Multilingual Language Identification Dataset
## Dataset Description
This dataset contains text samples from 5 Zimbabwean languages for language identification tasks. It represents a comprehensive collection of Zimbabwean linguistic data designed to support natural language processing and machine learning research for African languages.
## Languages
- **English (en)**: International language with widespread usage
- **Shona (sn)**: Major Zimbabwean language with ~7 million speakers
- **Chewa (ny)**: Bantu language spoken in Zimbabwe and neighboring countries (ISO 639-1: ny)
- **Ndebele (nd)**: Major Zimbabwean language with ~2 million speakers (ISO 639-1: nd)
- **Tonga (to)**: Bantu language spoken in Zimbabwe and Zambia (ISO 639-1: to)
## Dataset Statistics
- **Total Samples**: 164,595
- **Training Split**: 131,676 samples
- **Validation Split**: 16,459 samples
- **Test Split**: 16,460 samples
- **Average Length**: Varies by language and text source
- **Format**: FastText format with standardized labels
## Class Distribution
| Language | Code | Training | Validation | Test | Total |
|----------|------|----------|------------|------|-------|
| English | en | 26,357 | 3,222 | 3,340 | 32,919 |
| Shona | sn | 26,421 | 3,248 | 3,250 | 32,919 |
| Chewa | ny | 26,365 | 3,285 | 3,269 | 32,919 |
| Ndebele | nd | 26,326 | 3,344 | 3,249 | 32,919 |
| Tonga | to | 26,207 | 3,360 | 3,352 | 32,919 |
| **Total** | | **131,676** | **16,459** | **16,460** | **164,595** |
## Data Format
### FastText Format
```
__label__en This is English text
__label__sn Zvimhu zita rake shona
__label__ny Muli bwanji chichewa
__label__nd Salingelele isindebele
__label__to Mwapona wa ci Tonga
```
### Hugging Face Format
```json
{"text": "This is English text", "label": "en"}
{"text": "Zvimhu zita rake shona", "label": "sn"}
{"text": "Muli bwanji chichewa", "label": "ny"}
{"text": "Salingelele isindebele", "label": "nd"}
{"text": "Mwapona wa ci Tonga", "label": "to"}
```
## Data Sources
- **Public domain religious texts**: Open-access religious and educational materials
- **Educational materials**: School textbooks and learning resources
- **Open-access textual corpora**: Public domain literary and news content
- **Community-contributed samples**: Native speaker contributions and verified content
## Data Quality
- **Label Accuracy**: Manually verified by native speakers
- **Text Cleaning**: Standardized preprocessing applied
- **Dialect Balance**: Representative sampling across major dialects
- **Quality Control**: Automated validation and manual review
## Intended Uses
- **Language Identification**: Training classification models
- **Linguistic Research**: Studying language patterns and features
- **Educational**: Teaching NLP concepts with African languages
- **Cross-lingual Applications**: Building multilingual systems
## Limitations
- **Domain Specific**: Primarily formal text, limited informal language
- **Dialect Coverage**: May not cover all regional variations
- **Imbalance**: Some languages may have more samples than others
- **Code-switching**: Limited examples of mixed-language text
## Ethical Considerations
- **Representation**: Efforts made to balance language representation
- **Cultural Sensitivity**: Text reviewed for cultural appropriateness
- **Data Privacy**: Personal information removed from text samples
- **Community Involvement**: Native speakers consulted in validation
## Usage Example
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("omanyasa/zim-langid-v2")
# Access training data
train_data = dataset["train"]
print(f"Training samples: {len(train_data)}")
# View a sample
sample = train_data[0]
print(f"Text: {sample['text']}")
print(f"Language: {sample['label']}")
# Get label distribution
from collections import Counter
label_dist = Counter(train_data["label"])
print("Label distribution:", dict(label_dist))
```
## Research Impact
This dataset contributes to low-resource African NLP by:
- Supporting language identification for underrepresented languages
- Enabling multilingual model development for Zimbabwean languages
- Improving inclusivity in NLP systems for African languages
- Providing benchmark data for cross-lingual transfer learning
- Facilitating research on low-resource language processing
## Future Work
- **Expansion**: Scale to 16 Zimbabwean languages
- **Code-switching**: Include mixed-language text samples
- **Multimodal**: Add speech + text alignment data
- **LLM Integration**: Fine-tune large language models for Zimbabwean languages
- **Dialect Coverage**: Include regional variations and dialects
## Maintenance
- **Updates**: Planned quarterly with new text samples
- **Version Control**: Semantic versioning for dataset updates
- **Community Feedback**: Open to contributions and corrections
- **Quality Assurance**: Regular validation and cleaning processes
## Citation
If you use this dataset, please cite:
```
@dataset{zim_langid_v2,
title={Zimbabwe Multilingual Language Identification Dataset},
author={MSU National Language Institute (MSUNLI)},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/omanyasa/zim-langid-v2}
}
```
## License
This dataset is licensed under the MIT License, allowing for commercial and non-commercial use with proper attribution.
## Contact
For questions, contributions, or feedback regarding this dataset:
- **Repository**: https://huggingface.co/datasets/omanyasa/zim-langid-v2
- **Organization**: MSU National Language Institute (MSUNLI)
- **Issues**: Use GitHub issues or Hugging Face discussions
提供机构:
omanyasa



