five

The South African Gov-ZA multilingual corpus

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7635167
下载链接
链接失效反馈
官方服务:
资源简介:
The South African Gov-ZA multilingual corpus ============================== Github: https://github.com/dsfsi/gov-za-multilingual Zenodo: About Dataset --------------------- The data set contains cabinet statements from the South African government. Data was scraped from the governments website: https://www.gov.za/cabinet-statements The datasets contain government cabinet statements in 11 languages, namely: | Language | Code | Language | Code | |------------|------|------------|------| | English | (eng) | Sepedi | (nso) | | Afrikaans | (afr) | Setswana | (tsn) | | isiNdebele | (nbl) | Siswati | (ssw) | | isiXhosa | (xho) | Tshivenda | (ven) | | isiZulu | (zul) | Xitstonga | (tso) | | Sesotho | (sot) | The dataset contains the full data in a JSON file (/data/govza-cabinet-statements.json), as well as CSV’s split by each language, eg: “govza-cabinet-statements-en.csv” for english. The dataset does not contain special characters like unicode or ascii. Please see the [data-statement.md](/data_statement.md) for full dataset information. *(TODO)* Number of Aligned Pairs with Cosine Similarity Score >= 0.65 ------------------------------------------------------------ | src_lang | trg_lang | num_aligned_pairs | |----------|----------|-------------------| | afr | eng | 14549 | | afr | nbl | 6621 | | afr | nso | 15388 | | afr | sot | 8834 | | afr | ssw | 15610 | | afr | tsn | 12605 | | afr | tso | 14936 | | afr | ven | 5776 | | afr | xho | 16065 | | afr | zul | 14998 | | nbl | eng | 3616 | | nbl | nso | 6342 | | nbl | sot | 16163 | | nbl | ssw | 4655 | | nbl | tsn | 3369 | | nbl | tso | 4465 | | nbl | ven | 18984 | | nbl | xho | 5213 | | nbl | zul | 3868 | | nso | eng | 15257 | | nso | ssw | 18697 | | nso | tsn | 16179 | | nso | tso | 17617 | | nso | ven | 6367 | | sot | eng | 5212 | | sot | nso | 8077 | | sot | ssw | 5811 | | sot | tsn | 5450 | | sot | tso | 6586 | | sot | ven | 14098 | | ssw | eng | 15721 | | ssw | tso | 17880 | | ssw | ven | 4588 | | tsn | eng | 14544 | | tsn | ssw | 16386 | | tsn | tso | 16681 | | tsn | ven | 3267 | | tso | eng | 16068 | | ven | eng | 3670 | | ven | tso | 4578 | | xho | eng | 16537 | | xho | nso | 18110 | | xho | sot | 7489 | | xho | ssw | 18387 | | xho | tsn | 16571 | | xho | tso | 17954 | | xho | ven | 4559 | | xho | zul | 18145 | | zul | eng | 16149 | | zul | nso | 17630 | | zul | sot | 5975 | | zul | ssw | 18563 | | zul | tsn | 16482 | | zul | tso | 17789 | | zul | ven | 3606 | Authors ------- - Vukosi Marivate - [@vukosi](https://twitter.com/vukosi) - Matimba Shingange - Richard Lastrucci - Isheanesu Joseph Dzingirai - Jenalea Rajab Publications ------- > @inproceedings{lastrucci-etal-2023-preparing,     title = "Preparing the Vuk{'}uzenzele and {ZA}-gov-multilingual {S}outh {A}frican multilingual corpora",     author = "Richard Lastrucci and Isheanesu Dzingirai and Jenalea Rajab and Andani Madodonga and Matimba Shingange and Daniel Njini and Vukosi Marivate",     booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)",     month = may,     year = "2023",     address = "Dubrovnik, Croatia",     publisher = "Association for Computational Linguistics",     url = "https://aclanthology.org/2023.rail-1.3",     pages = "18--25" }
创建时间:
2023-07-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作