Autshumato parallel corpora
收藏arXiv2019-06-18 更新2024-06-21 收录
下载链接:
https://repo.sadilar.org/handle/20.500.12185/404
下载链接
链接失效反馈官方服务:
资源简介:
Autshumato parallel corpora是由南非政府数据创建的平行语料库,涵盖英语到阿非利卡语、isiZulu、Northern Sotho、Setswana和Xitsonga五种语言。该数据集包含53172条平行句子,主要来源于南非政府文件,旨在支持非洲语言的机器翻译研究。数据集的创建过程涉及数据的收集、清洗和去重,以确保数据质量。该数据集的应用领域主要集中在机器翻译和语言技术,特别是针对非洲语言的研究,以解决资源稀缺和可发现性问题。
Autshumato Parallel Corpora is a parallel corpus developed from South African government data, covering five language pairs: English to Afrikaans, isiZulu, Northern Sotho, Setswana, and Xitsonga. This dataset contains 53,172 parallel sentence pairs, primarily sourced from South African government documents, and is intended to support machine translation research for African languages. The dataset creation process involves data collection, cleaning, and deduplication to ensure high data quality. Its main application domains focus on machine translation and language technology, particularly research targeting African languages, aiming to address the challenges of resource scarcity and limited discoverability of language resources.
提供机构:
探索数据科学学院
创建时间:
2019-06-18



