five

SimbaMaw1547/south-african-monolingual-corpora-jsonl

收藏
Hugging Face2025-08-26 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/SimbaMaw1547/south-african-monolingual-corpora-jsonl
下载链接
链接失效反馈
官方服务:
资源简介:
# South African Languages Pretraining Dataset This dataset contains pretraining text data for 9 South African languages, compiled from multiple sources including CC100, Glot500, mC4, ParaCrawl, and various corpora collections. The datasets were gathered as part of the University of Cape Town's SALLM project. Where data was gathered from multiple sources, extensive filtering and deduplication was conducted to ensure dataset integrity ## Languages Included | Language Code | Language Name | Available Datasets | |---------------|---------------|-------------------| | `nbl` | Southern Ndebele | corpora | | `nso` | Northern Sotho (Sepedi) | cc100, corpora, glot500, paracrawl | | `sot` | Southern Sotho (Sesotho) | corpora, glot500, mc4 | | `ssw` | Swati (Siswati) | cc100, corpora, glot500, paracrawl | | `tsn` | Tswana (Setswana) | cc100, corpora, glot500, paracrawl | | `tso` | Tsonga (Xitsonga) | corpora, glot500, paracrawl | | `ven` | Venda (Tshivenda) | corpora, glot500 | | `xho` | Xhosa (isiXhosa) | cc100, corpora, glot500, inkuba, mc4, paracrawl, wura | | `zul` | Zulu (isiZulu) | cc100, corpora, glot500, inkuba, mc4, paracrawl, wura | ## Dataset Structure ``` dataset/ ├── nbl/ │ └── train.jsonl ├── nso/ │ └── train.jsonl ├── sot/ │ └── train.jsonl ├── ssw/ │ └── train.jsonl ├── tsn/ │ └── train.jsonl ├── tso/ │ └── train.jsonl ├── ven/ │ └── train.jsonl ├── xho/ │ └── train.jsonl └── zul/ └── train.jsonl ``` Each `train.jsonl` file contains the concatenated data from all available source datasets for that language. The data is stored in JSON Lines format with one JSON object per line. ## Data Format Each line in the JSONL files contains a JSON object with text data: ```json {"text": "Sample text in the target language..."} ``` ## Source Datasets - **CC100**: CommonCrawl-based multilingual dataset - **Corpora**: Various text corpora collections - **Glot500**: Multilingual dataset covering 500+ languages - **Inkuba**: South African language corpus (available for Xhosa and Zulu) - **mC4**: Multilingual Colossal Clean Crawled Corpus - **ParaCrawl**: Parallel corpus extracted from web crawls - **Wura**: West and Central African language corpus (available for Xhosa and Zulu) ## Usage ### Loading the Dataset ```python from datasets import load_dataset # Load a specific language dataset = load_dataset("SimbaMaw1547/south-african-monolingual-corpora-jsonl", data_dir="zul", split="train") # Load all languages all_languages = load_dataset("SimbaMaw1547/south-african-monolingual-corpora-jsonl") # Access specific language data zulu_data = all_languages["zul"]["train"] xhosa_data = all_languages["xho"]["train"] ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @dataset{south_african_languages_pretraining, title={South African Languages Pretraining Dataset}, author={[Simbarashe Mawere]}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/datasets/SimbaMaw1547/south-african-monolingual-corpora-jsonl} } ``` ## License Please refer to the individual source dataset licenses: - [CC100](https://commoncrawl.org/terms-of-use) - [Glot500](https://huggingface.co/datasets/SimbaMaw1547/south-african-monolingual-corpora-jsonl/new/main?filename=README.md) - mC4: Apache 2.0 - ParaCrawl: CC0 - Corpora: [Various sources] - [Inkuba](https://huggingface.co/datasets/lelapa/Inkuba-Mono) - [Wura](https://huggingface.co/datasets/castorini/wura) ## Data Quality and Preprocessing The data has been processed and concatenated from the original sources. Users should be aware that: - Text quality may vary across different sources - Some datasets may contain noise or irrelevant content - Deduplication may be needed depending on your use case - Language detection accuracy may vary ## Contributions If you find issues with the data or have suggestions for improvements, please open an issue in the dataset repository. ## Acknowledgments We acknowledge the creators and contributors of the source datasets: - CommonCrawl for CC100 - The CIS team at Ludwig Maximilian University for Glot500 - mC4 team at Google - ParaCrawl project - Lelapa for Inkuba dataset - Contributors to Wura including Masakhane - Various corpora contributors This work contributes to the preservation and computational accessibility of South African languages.

South African Languages Pretraining Dataset
提供机构:
SimbaMaw1547
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作