silvanosolutions/xhosa-nlp-dataset
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/silvanosolutions/xhosa-nlp-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- xh
- en
license: other
multilinguality: translation
size_categories: 100K<n<1M
source_datasets:
- opus_books
- wikipedia
- glot500
- masakhane/mafand
task_categories:
- translation
- text-generation
- token-classification
tags:
- xhosa
- isixhosa
- south-africa
- african-languages
- low-resource
- nlp
- parallel-corpus
pretty_name: isiXhosa NLP Dataset
dataset_info:
version: 1.0.0
configs:
- config_name: monolingual
data_files:
- split: train
path: monolingual.parquet
- split: validation
path: splits/validation.parquet
- split: test
path: splits/test.parquet
- config_name: parallel
data_files:
- split: train
path: parallel.parquet
- split: validation
path: splits/validation.parquet
- split: test
path: splits/test.parquet
---
# 🇿🇦 Xhosa NLP Dataset
[]()
[]()
[](#data-sources-and-licenses)
[](https://huggingface.co/datasets/silvanosolutions/xhosa-nlp-dataset)
A high-quality, comprehensive isiXhosa (Xhosa) NLP training dataset carefully collected, cleaned, and packaged for training AI language models, translation systems, and other natural language processing applications.
## 📖 Introduction
isiXhosa is one of South Africa's 11 official languages, spoken by millions of people. However, like many African languages, it remains severely underrepresented in AI and machine learning training data.
**xhosa-nlp-dataset** is built to bridge this gap. This project aggregates high-quality text data from various sources—ranging from government documents and news articles to encyclopedic knowledge and general web text—processing them into standard, machine learning-ready formats.
Whether you are a machine learning researcher training the next big LLM, a startup building localized South African products, or a developer extending translation systems, this dataset provides a robust foundational resource for Xhosa NLP.
## 📊 Dataset Statistics
The dataset contains a total of **155,380 sentences**, broadly categorized into monolingual (Xhosa-only) and parallel (Xhosa ↔ English) subsets.
The following table shows raw records collected per source before cleaning and deduplication.
| Source | Type | Raw Records | Domains |
| :--- | :--- | :--- | :--- |
| **OPUS-100 EN↔XH** | Parallel | 267,920 | General web text |
| **CC-100 / Glot500** | Monolingual | 50,000 | General web text |
| **Autshumato SA Gov** | Parallel | 44,442 | Government and legal |
| **Wikipedia isiXhosa** | Monolingual | 17,997 | Encyclopedic knowledge |
| **MasakhaNews** | Monolingual | 2,305 | News articles |
| **XhosaNavy (Stellenbosch University)** ¹ | Parallel | 44,442 | Government and legal |
| **Total** | | **382,664** | |
*Dataset after cleaning and deduplication. Raw collected records: 382,664.*
¹ *License pending verification — see [Data Sources and Licenses](#data-sources-and-licenses) for full details.*
**Breakdown by Type:**
* **Monolingual Data:** 44,699 records
* **Parallel Data:** 110,681 records
### Splits
| Split | Records | Monolingual | Parallel |
|------------|---------|-------------|----------|
| Train | 124,303 | 35,759 | 88,544 |
| Validation | 15,537 | 4,469 | 11,068 |
| Test | 15,540 | 4,471 | 11,069 |
## 🧩 Data Formats
The dataset is packaged in clear, easy-to-use JSON Lines (JSONL) formats.
### Monolingual Records
Xhosa-only texts primarily designed for pre-training and self-supervised learning.
```json
{
"id": "wiki_42_3",
"text": "Umntu ngumntu ngabantu.",
"source": "wikipedia_xh",
"type": "monolingual",
"domain": "general",
"license": "CC-BY-SA"
}
```
### Parallel Records
Aligned Xhosa and English sentence pairs, ideal for translation models and cross-lingual transfer learning.
```json
{
"id": "opus_12345",
"xhosa": "Umntu ngumntu ngabantu.",
"english": "A person is a person through other people.",
"source": "opus100",
"type": "parallel",
"domain": "general",
"license": "CC-BY"
}
```
## 🚀 Installation & Usage
You can load and use this dataset directly via the [Hugging Face `datasets`](https://huggingface.co/docs/datasets/) library in Python.
First, ensure you have the required library installed:
```bash
pip install datasets
```
Then, load the dataset in your Python environment:
```python
from datasets import load_dataset
# Load the monolingual dataset split
monolingual_ds = load_dataset("silvanosolutions/xhosa-nlp-dataset", "monolingual", split="train")
print(monolingual_ds[0])
# Load the parallel dataset split
parallel_ds = load_dataset("silvanosolutions/xhosa-nlp-dataset", "parallel", split="train")
print(parallel_ds[0])
```
## 🏗️ Project Structure
The underlying pipeline (Python 3.13) that collects, cleans, and builds this dataset is organized as follows:
```text
.
├── scrapers/
│ ├── __init__.py
│ ├── utils.py # Shared utilities: DIRS, HEADERS, log_section
│ ├── oscar_scraper.py # SOURCE 1: CC-100/Glot500 monolingual
│ ├── mafand_scraper.py # SOURCE 2: OPUS-100 parallel
│ ├── autshumato_scraper.py # SOURCE 3: Autshumato government parallel
│ ├── wikipedia_scrapper.py # SOURCE 4: isiXhosa Wikipedia
│ ├── government_scrapper.py # SOURCE 5: MasakhaNews
│ └── run_all.py # Entry point to run all scrapers
├── cleaning/
│ ├── __init__.py
│ ├── utils.py
│ ├── language_verifier.py
│ ├── deduplicator.py
│ ├── domain_tagger.py
│ ├── packager.py
│ └── run_cleaning.py
└── data/
├── raw/ # Raw collected data per source
├── cleaned/ # Deduplicated and quality-filtered data
└── final/ # Packaged dataset ready for HF upload
```
Libraries used in processing: `datasets`, `beautifulsoup4`, `pandas`, `langdetect`, `ftfy`, `requests`.
## ⚖️ Data Sources and Licenses
Because this dataset is an aggregation of several underlying corpora, the respective data points maintain their original licenses.
| Dataset Source | Original License |
| :--- | :--- |
| **Glot500** | Apache-2.0 |
| **OPUS-100** | CC-BY |
| **Autshumato** | CC-BY |
| **Wikipedia** | CC-BY-SA |
| **MasakhaNews** | CC-BY |
| **XhosaNavy (Stellenbosch University)** | ⚠️ Pending verification |
> ⚠️ **XhosaNavy License Notice**
> The XhosaNavy corpus was sourced from
> [OPUS](https://opus.nlpl.eu/datasets/XhosaNavy)
> and originated from research at Stellenbosch
> University (Herman Engelbrecht, Dept. of E&E
> Engineering). OPUS explicitly states it does
> not own the source text and cannot guarantee
> redistribution rights. License confirmation
> for commercial redistribution is currently
> being sought from the original author.
> **Commercial users should contact the
> maintainer before using records where
> `source == "xhosanavey"`.**
## 🎯 Intended Use Cases
This dataset is designed specifically for:
1. **Language Modeling:** Training or continuing pre-training of Xhosa language models.
2. **Multilingual LLMs:** Fine-tuning multilingual models (e.g., AfroXLMR, AfriBERTa) to improve Xhosa comprehension.
3. **Machine Translation:** Building high-fidelity Xhosa-English and English-Xhosa translation systems.
4. **Sentiment Analysis:** Training commercial sentiment classifiers and customer feedback analyzers in Xhosa.
5. **Named Entity Recognition:** Teaching systems to correctly identify entities in Xhosa text.
6. **Commercial African Tech:** Providing training data for products targeting Xhosa speakers in the South African and broader African markets.
## 🤝 How to Contribute
We welcome contributions from researchers and developers! If you have scripts for scraping additional Xhosa data, notice data quality issues, or want to contribute new parallel sets, please:
1. Fork the repository.
2. Set up your Python 3.13 environment.
3. Add or update scraper files inside the `/scrapers` directory using existing shared utilities.
4. Ensure text encoding fixes and language verification steps are applied.
5. Open a Pull Request detailing your additions or fixes.
## 📝 Citation
If you use this dataset in a research publication or project, please cite it using the following format:
```bibtex
@dataset{xhosa_nlp_dataset_2026,
author = {Ntsika Silvano},
title = {Xhosa NLP Dataset: A Comprehensive IsiXhosa Text Corpus},
year = {2026},
version = {1.0.0},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/silvanosolutions/xhosa-nlp-dataset}},
}
```
*(Please also ensure you cite the original source datasets: OPUS-100, MasakhaNews, Autshumato, Wikipedia, and Glot500 where applicable.)*
## 📜 License
### Code License
The aggregation and compilation code (`/scrapers`), utilities, and pipelines in this repository are licensed under the **MIT License**.
### Data Licenses
The underlying processed data retains the original licenses of their respective sources as stated in the [Data Sources and Licenses](#data-sources-and-licenses) section.
### Commercial Licensing
For commercial dataset access or alternative licensing agreements, please refer to the Contact section below.
## 📬 Contact & Maintainers
For licensing inquiries for commercial dataset applications, usage questions, or partnership opportunities, please reach out.
* **Maintainer:** Ntsika Silvano
* **Issues:** Please open a GitHub issue if you spot data anomalies or bugs.
提供机构:
silvanosolutions



