dsfsi/govza-sa-cabinet-statements-sentence-aligned
收藏Hugging Face2026-01-01 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/dsfsi/govza-sa-cabinet-statements-sentence-aligned
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含来自南非政府内阁声明的句子对齐平行文本,涵盖11种南非官方语言。数据来源于政府通信和信息系统(GCIS),并从www.gov.za/cabinet-statements网站抓取。主要特点包括:55种语言对组合、使用LASER嵌入的句子级对齐、对齐置信度分数(余弦相似度)、每种语言对的训练/测试/评估分割,以及代表低资源非洲语言。数据集支持多种南非语言之间的翻译任务,包括南非荷兰语、英语、isiNdebele、isiXhosa、isiZulu、Sepedi(北索托语)、Setswana、Siswati、Tshivenda和Xitstonga等。
This dataset contains sentence-aligned parallel text from South African government cabinet statements in 11 official languages. The data is sourced from the Government Communication and Information System (GCIS) and scraped from www.gov.za/cabinet-statements. Key features include: 55 language pair combinations, sentence-level alignment using LASER embeddings, alignment confidence scores (cosine similarity), train/test/eval splits for each language pair, and representation of low-resource African languages. The dataset supports translation tasks between various South African languages including Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sepedi (Northern Sotho), Setswana, Siswati, Tshivenda, and Xitstonga.
提供机构:
dsfsi



