silvanosolutions/xhosa-nlp-dataset

Name: silvanosolutions/xhosa-nlp-dataset
Creator: silvanosolutions
Published: 2026-04-06 10:27:11
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/silvanosolutions/xhosa-nlp-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - xh - en license: other multilinguality: translation size_categories: 100K<n<1M source_datasets: - opus_books - wikipedia - glot500 - masakhane/mafand task_categories: - translation - text-generation - token-classification tags: - xhosa - isixhosa - south-africa - african-languages - low-resource - nlp - parallel-corpus pretty_name: isiXhosa NLP Dataset dataset_info: version: 1.0.0 configs: - config_name: monolingual data_files: - split: train path: monolingual.parquet - split: validation path: splits/validation.parquet - split: test path: splits/test.parquet - config_name: parallel data_files: - split: train path: parallel.parquet - split: validation path: splits/validation.parquet - split: test path: splits/test.parquet --- # 🇿🇦 Xhosa NLP Dataset [![Dataset Size](https://img.shields.io/badge/Dataset_Size-155k_Records-blue)]() [![Language](https://img.shields.io/badge/Language-isiXhosa_%28xh%29-green)]() [![License](https://img.shields.io/badge/License-Varies-lightgrey)](#data-sources-and-licenses) [![Hugging Face](https://img.shields.io/badge/🤗_Hugging_Face-View_Dataset-ffcc00)](https://huggingface.co/datasets/silvanosolutions/xhosa-nlp-dataset) A high-quality, comprehensive isiXhosa (Xhosa) NLP training dataset carefully collected, cleaned, and packaged for training AI language models, translation systems, and other natural language processing applications. ## 📖 Introduction isiXhosa is one of South Africa's 11 official languages, spoken by millions of people. However, like many African languages, it remains severely underrepresented in AI and machine learning training data. **xhosa-nlp-dataset** is built to bridge this gap. This project aggregates high-quality text data from various sources—ranging from government documents and news articles to encyclopedic knowledge and general web text—processing them into standard, machine learning-ready formats. Whether you are a machine learning researcher training the next big LLM, a startup building localized South African products, or a developer extending translation systems, this dataset provides a robust foundational resource for Xhosa NLP. ## 📊 Dataset Statistics The dataset contains a total of **155,380 sentences**, broadly categorized into monolingual (Xhosa-only) and parallel (Xhosa ↔ English) subsets. The following table shows raw records collected per source before cleaning and deduplication. | Source | Type | Raw Records | Domains | | :--- | :--- | :--- | :--- | | **OPUS-100 EN↔XH** | Parallel | 267,920 | General web text | | **CC-100 / Glot500** | Monolingual | 50,000 | General web text | | **Autshumato SA Gov** | Parallel | 44,442 | Government and legal | | **Wikipedia isiXhosa** | Monolingual | 17,997 | Encyclopedic knowledge | | **MasakhaNews** | Monolingual | 2,305 | News articles | | **XhosaNavy (Stellenbosch University)** ¹ | Parallel | 44,442 | Government and legal | | **Total** | | **382,664** | | *Dataset after cleaning and deduplication. Raw collected records: 382,664.* ¹ *License pending verification — see [Data Sources and Licenses](#data-sources-and-licenses) for full details.* **Breakdown by Type:** * **Monolingual Data:** 44,699 records * **Parallel Data:** 110,681 records ### Splits | Split | Records | Monolingual | Parallel | |------------|---------|-------------|----------| | Train | 124,303 | 35,759 | 88,544 | | Validation | 15,537 | 4,469 | 11,068 | | Test | 15,540 | 4,471 | 11,069 | ## 🧩 Data Formats The dataset is packaged in clear, easy-to-use JSON Lines (JSONL) formats. ### Monolingual Records Xhosa-only texts primarily designed for pre-training and self-supervised learning. ```json { "id": "wiki_42_3", "text": "Umntu ngumntu ngabantu.", "source": "wikipedia_xh", "type": "monolingual", "domain": "general", "license": "CC-BY-SA" } ``` ### Parallel Records Aligned Xhosa and English sentence pairs, ideal for translation models and cross-lingual transfer learning. ```json { "id": "opus_12345", "xhosa": "Umntu ngumntu ngabantu.", "english": "A person is a person through other people.", "source": "opus100", "type": "parallel", "domain": "general", "license": "CC-BY" } ``` ## 🚀 Installation & Usage You can load and use this dataset directly via the [Hugging Face `datasets`](https://huggingface.co/docs/datasets/) library in Python. First, ensure you have the required library installed: ```bash pip install datasets ``` Then, load the dataset in your Python environment: ```python from datasets import load_dataset # Load the monolingual dataset split monolingual_ds = load_dataset("silvanosolutions/xhosa-nlp-dataset", "monolingual", split="train") print(monolingual_ds[0]) # Load the parallel dataset split parallel_ds = load_dataset("silvanosolutions/xhosa-nlp-dataset", "parallel", split="train") print(parallel_ds[0]) ``` ## 🏗️ Project Structure The underlying pipeline (Python 3.13) that collects, cleans, and builds this dataset is organized as follows: ```text . ├── scrapers/ │ ├── __init__.py │ ├── utils.py # Shared utilities: DIRS, HEADERS, log_section │ ├── oscar_scraper.py # SOURCE 1: CC-100/Glot500 monolingual │ ├── mafand_scraper.py # SOURCE 2: OPUS-100 parallel │ ├── autshumato_scraper.py # SOURCE 3: Autshumato government parallel │ ├── wikipedia_scrapper.py # SOURCE 4: isiXhosa Wikipedia │ ├── government_scrapper.py # SOURCE 5: MasakhaNews │ └── run_all.py # Entry point to run all scrapers ├── cleaning/ │ ├── __init__.py │ ├── utils.py │ ├── language_verifier.py │ ├── deduplicator.py │ ├── domain_tagger.py │ ├── packager.py │ └── run_cleaning.py └── data/ ├── raw/ # Raw collected data per source ├── cleaned/ # Deduplicated and quality-filtered data └── final/ # Packaged dataset ready for HF upload ``` Libraries used in processing: `datasets`, `beautifulsoup4`, `pandas`, `langdetect`, `ftfy`, `requests`. ## ⚖️ Data Sources and Licenses Because this dataset is an aggregation of several underlying corpora, the respective data points maintain their original licenses. | Dataset Source | Original License | | :--- | :--- | | **Glot500** | Apache-2.0 | | **OPUS-100** | CC-BY | | **Autshumato** | CC-BY | | **Wikipedia** | CC-BY-SA | | **MasakhaNews** | CC-BY | | **XhosaNavy (Stellenbosch University)** | ⚠️ Pending verification | > ⚠️ **XhosaNavy License Notice** > The XhosaNavy corpus was sourced from > [OPUS](https://opus.nlpl.eu/datasets/XhosaNavy) > and originated from research at Stellenbosch > University (Herman Engelbrecht, Dept. of E&E > Engineering). OPUS explicitly states it does > not own the source text and cannot guarantee > redistribution rights. License confirmation > for commercial redistribution is currently > being sought from the original author. > **Commercial users should contact the > maintainer before using records where > `source == "xhosanavey"`.** ## 🎯 Intended Use Cases This dataset is designed specifically for: 1. **Language Modeling:** Training or continuing pre-training of Xhosa language models. 2. **Multilingual LLMs:** Fine-tuning multilingual models (e.g., AfroXLMR, AfriBERTa) to improve Xhosa comprehension. 3. **Machine Translation:** Building high-fidelity Xhosa-English and English-Xhosa translation systems. 4. **Sentiment Analysis:** Training commercial sentiment classifiers and customer feedback analyzers in Xhosa. 5. **Named Entity Recognition:** Teaching systems to correctly identify entities in Xhosa text. 6. **Commercial African Tech:** Providing training data for products targeting Xhosa speakers in the South African and broader African markets. ## 🤝 How to Contribute We welcome contributions from researchers and developers! If you have scripts for scraping additional Xhosa data, notice data quality issues, or want to contribute new parallel sets, please: 1. Fork the repository. 2. Set up your Python 3.13 environment. 3. Add or update scraper files inside the `/scrapers` directory using existing shared utilities. 4. Ensure text encoding fixes and language verification steps are applied. 5. Open a Pull Request detailing your additions or fixes. ## 📝 Citation If you use this dataset in a research publication or project, please cite it using the following format: ```bibtex @dataset{xhosa_nlp_dataset_2026, author = {Ntsika Silvano}, title = {Xhosa NLP Dataset: A Comprehensive IsiXhosa Text Corpus}, year = {2026}, version = {1.0.0}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/silvanosolutions/xhosa-nlp-dataset}}, } ``` *(Please also ensure you cite the original source datasets: OPUS-100, MasakhaNews, Autshumato, Wikipedia, and Glot500 where applicable.)* ## 📜 License ### Code License The aggregation and compilation code (`/scrapers`), utilities, and pipelines in this repository are licensed under the **MIT License**. ### Data Licenses The underlying processed data retains the original licenses of their respective sources as stated in the [Data Sources and Licenses](#data-sources-and-licenses) section. ### Commercial Licensing For commercial dataset access or alternative licensing agreements, please refer to the Contact section below. ## 📬 Contact & Maintainers For licensing inquiries for commercial dataset applications, usage questions, or partnership opportunities, please reach out. * **Maintainer:** Ntsika Silvano * **Issues:** Please open a GitHub issue if you spot data anomalies or bugs.

提供机构：

silvanosolutions

5,000+

优质数据集

54 个

任务类型

进入经典数据集