SimbaMaw1547/south-african-monolingual-corpora-jsonl
收藏Hugging Face2025-08-26 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/SimbaMaw1547/south-african-monolingual-corpora-jsonl
下载链接
链接失效反馈官方服务:
资源简介:
# South African Languages Pretraining Dataset
This dataset contains pretraining text data for 9 South African languages, compiled from multiple sources including CC100, Glot500, mC4, ParaCrawl, and various corpora collections.
The datasets were gathered as part of the University of Cape Town's SALLM project. Where data was gathered from multiple sources, extensive filtering and deduplication was conducted to ensure dataset integrity
## Languages Included
| Language Code | Language Name | Available Datasets |
|---------------|---------------|-------------------|
| `nbl` | Southern Ndebele | corpora |
| `nso` | Northern Sotho (Sepedi) | cc100, corpora, glot500, paracrawl |
| `sot` | Southern Sotho (Sesotho) | corpora, glot500, mc4 |
| `ssw` | Swati (Siswati) | cc100, corpora, glot500, paracrawl |
| `tsn` | Tswana (Setswana) | cc100, corpora, glot500, paracrawl |
| `tso` | Tsonga (Xitsonga) | corpora, glot500, paracrawl |
| `ven` | Venda (Tshivenda) | corpora, glot500 |
| `xho` | Xhosa (isiXhosa) | cc100, corpora, glot500, inkuba, mc4, paracrawl, wura |
| `zul` | Zulu (isiZulu) | cc100, corpora, glot500, inkuba, mc4, paracrawl, wura |
## Dataset Structure
```
dataset/
├── nbl/
│ └── train.jsonl
├── nso/
│ └── train.jsonl
├── sot/
│ └── train.jsonl
├── ssw/
│ └── train.jsonl
├── tsn/
│ └── train.jsonl
├── tso/
│ └── train.jsonl
├── ven/
│ └── train.jsonl
├── xho/
│ └── train.jsonl
└── zul/
└── train.jsonl
```
Each `train.jsonl` file contains the concatenated data from all available source datasets for that language. The data is stored in JSON Lines format with one JSON object per line.
## Data Format
Each line in the JSONL files contains a JSON object with text data:
```json
{"text": "Sample text in the target language..."}
```
## Source Datasets
- **CC100**: CommonCrawl-based multilingual dataset
- **Corpora**: Various text corpora collections
- **Glot500**: Multilingual dataset covering 500+ languages
- **Inkuba**: South African language corpus (available for Xhosa and Zulu)
- **mC4**: Multilingual Colossal Clean Crawled Corpus
- **ParaCrawl**: Parallel corpus extracted from web crawls
- **Wura**: West and Central African language corpus (available for Xhosa and Zulu)
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load a specific language
dataset = load_dataset("SimbaMaw1547/south-african-monolingual-corpora-jsonl", data_dir="zul", split="train")
# Load all languages
all_languages = load_dataset("SimbaMaw1547/south-african-monolingual-corpora-jsonl")
# Access specific language data
zulu_data = all_languages["zul"]["train"]
xhosa_data = all_languages["xho"]["train"]
```
## Citation
If you use this dataset in your research, please cite:
```bibtex
@dataset{south_african_languages_pretraining,
title={South African Languages Pretraining Dataset},
author={[Simbarashe Mawere]},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/datasets/SimbaMaw1547/south-african-monolingual-corpora-jsonl}
}
```
## License
Please refer to the individual source dataset licenses:
- [CC100](https://commoncrawl.org/terms-of-use)
- [Glot500](https://huggingface.co/datasets/SimbaMaw1547/south-african-monolingual-corpora-jsonl/new/main?filename=README.md)
- mC4: Apache 2.0
- ParaCrawl: CC0
- Corpora: [Various sources]
- [Inkuba](https://huggingface.co/datasets/lelapa/Inkuba-Mono)
- [Wura](https://huggingface.co/datasets/castorini/wura)
## Data Quality and Preprocessing
The data has been processed and concatenated from the original sources. Users should be aware that:
- Text quality may vary across different sources
- Some datasets may contain noise or irrelevant content
- Deduplication may be needed depending on your use case
- Language detection accuracy may vary
## Contributions
If you find issues with the data or have suggestions for improvements, please open an issue in the dataset repository.
## Acknowledgments
We acknowledge the creators and contributors of the source datasets:
- CommonCrawl for CC100
- The CIS team at Ludwig Maximilian University for Glot500
- mC4 team at Google
- ParaCrawl project
- Lelapa for Inkuba dataset
- Contributors to Wura including Masakhane
- Various corpora contributors
This work contributes to the preservation and computational accessibility of South African languages.
South African Languages Pretraining Dataset
提供机构:
SimbaMaw1547



