five

omanyasa/zim-langid-v2

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/omanyasa/zim-langid-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - sn - ny - nd - to license: mit multilinguality: - multilingual size_categories: - 100K<n<1M task_categories: - text-classification task_ids: - language-identification pretty_name: Zimbabwe Multilingual Language Identification Dataset tags: - languages - zimbabwe - africa - language-identification - english - shona - chewa - ndebele - tonga - low-resource - multilingual --- # Dataset Card for Zimbabwe Multilingual Language Identification Dataset ## Dataset Description This dataset contains text samples from 5 Zimbabwean languages for language identification tasks. It represents a comprehensive collection of Zimbabwean linguistic data designed to support natural language processing and machine learning research for African languages. ## Languages - **English (en)**: International language with widespread usage - **Shona (sn)**: Major Zimbabwean language with ~7 million speakers - **Chewa (ny)**: Bantu language spoken in Zimbabwe and neighboring countries (ISO 639-1: ny) - **Ndebele (nd)**: Major Zimbabwean language with ~2 million speakers (ISO 639-1: nd) - **Tonga (to)**: Bantu language spoken in Zimbabwe and Zambia (ISO 639-1: to) ## Dataset Statistics - **Total Samples**: 164,595 - **Training Split**: 131,676 samples - **Validation Split**: 16,459 samples - **Test Split**: 16,460 samples - **Average Length**: Varies by language and text source - **Format**: FastText format with standardized labels ## Class Distribution | Language | Code | Training | Validation | Test | Total | |----------|------|----------|------------|------|-------| | English | en | 26,357 | 3,222 | 3,340 | 32,919 | | Shona | sn | 26,421 | 3,248 | 3,250 | 32,919 | | Chewa | ny | 26,365 | 3,285 | 3,269 | 32,919 | | Ndebele | nd | 26,326 | 3,344 | 3,249 | 32,919 | | Tonga | to | 26,207 | 3,360 | 3,352 | 32,919 | | **Total** | | **131,676** | **16,459** | **16,460** | **164,595** | ## Data Format ### FastText Format ``` __label__en This is English text __label__sn Zvimhu zita rake shona __label__ny Muli bwanji chichewa __label__nd Salingelele isindebele __label__to Mwapona wa ci Tonga ``` ### Hugging Face Format ```json {"text": "This is English text", "label": "en"} {"text": "Zvimhu zita rake shona", "label": "sn"} {"text": "Muli bwanji chichewa", "label": "ny"} {"text": "Salingelele isindebele", "label": "nd"} {"text": "Mwapona wa ci Tonga", "label": "to"} ``` ## Data Sources - **Public domain religious texts**: Open-access religious and educational materials - **Educational materials**: School textbooks and learning resources - **Open-access textual corpora**: Public domain literary and news content - **Community-contributed samples**: Native speaker contributions and verified content ## Data Quality - **Label Accuracy**: Manually verified by native speakers - **Text Cleaning**: Standardized preprocessing applied - **Dialect Balance**: Representative sampling across major dialects - **Quality Control**: Automated validation and manual review ## Intended Uses - **Language Identification**: Training classification models - **Linguistic Research**: Studying language patterns and features - **Educational**: Teaching NLP concepts with African languages - **Cross-lingual Applications**: Building multilingual systems ## Limitations - **Domain Specific**: Primarily formal text, limited informal language - **Dialect Coverage**: May not cover all regional variations - **Imbalance**: Some languages may have more samples than others - **Code-switching**: Limited examples of mixed-language text ## Ethical Considerations - **Representation**: Efforts made to balance language representation - **Cultural Sensitivity**: Text reviewed for cultural appropriateness - **Data Privacy**: Personal information removed from text samples - **Community Involvement**: Native speakers consulted in validation ## Usage Example ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("omanyasa/zim-langid-v2") # Access training data train_data = dataset["train"] print(f"Training samples: {len(train_data)}") # View a sample sample = train_data[0] print(f"Text: {sample['text']}") print(f"Language: {sample['label']}") # Get label distribution from collections import Counter label_dist = Counter(train_data["label"]) print("Label distribution:", dict(label_dist)) ``` ## Research Impact This dataset contributes to low-resource African NLP by: - Supporting language identification for underrepresented languages - Enabling multilingual model development for Zimbabwean languages - Improving inclusivity in NLP systems for African languages - Providing benchmark data for cross-lingual transfer learning - Facilitating research on low-resource language processing ## Future Work - **Expansion**: Scale to 16 Zimbabwean languages - **Code-switching**: Include mixed-language text samples - **Multimodal**: Add speech + text alignment data - **LLM Integration**: Fine-tune large language models for Zimbabwean languages - **Dialect Coverage**: Include regional variations and dialects ## Maintenance - **Updates**: Planned quarterly with new text samples - **Version Control**: Semantic versioning for dataset updates - **Community Feedback**: Open to contributions and corrections - **Quality Assurance**: Regular validation and cleaning processes ## Citation If you use this dataset, please cite: ``` @dataset{zim_langid_v2, title={Zimbabwe Multilingual Language Identification Dataset}, author={MSU National Language Institute (MSUNLI)}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/omanyasa/zim-langid-v2} } ``` ## License This dataset is licensed under the MIT License, allowing for commercial and non-commercial use with proper attribution. ## Contact For questions, contributions, or feedback regarding this dataset: - **Repository**: https://huggingface.co/datasets/omanyasa/zim-langid-v2 - **Organization**: MSU National Language Institute (MSUNLI) - **Issues**: Use GitHub issues or Hugging Face discussions
提供机构:
omanyasa
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作