Autshumato parallel corpora

Name: Autshumato parallel corpora
Creator: 探索数据科学学院
Published: 2019-06-18 02:47:28
License: 暂无描述

arXiv2019-06-18 更新2024-06-21 收录

下载链接：

https://repo.sadilar.org/handle/20.500.12185/404

下载链接

链接失效反馈

官方服务：

资源简介：

Autshumato parallel corpora是由南非政府数据创建的平行语料库，涵盖英语到阿非利卡语、isiZulu、Northern Sotho、Setswana和Xitsonga五种语言。该数据集包含53172条平行句子，主要来源于南非政府文件，旨在支持非洲语言的机器翻译研究。数据集的创建过程涉及数据的收集、清洗和去重，以确保数据质量。该数据集的应用领域主要集中在机器翻译和语言技术，特别是针对非洲语言的研究，以解决资源稀缺和可发现性问题。

Autshumato Parallel Corpora is a parallel corpus developed from South African government data, covering five language pairs: English to Afrikaans, isiZulu, Northern Sotho, Setswana, and Xitsonga. This dataset contains 53,172 parallel sentence pairs, primarily sourced from South African government documents, and is intended to support machine translation research for African languages. The dataset creation process involves data collection, cleaning, and deduplication to ensure high data quality. Its main application domains focus on machine translation and language technology, particularly research targeting African languages, aiming to address the challenges of resource scarcity and limited discoverability of language resources.

提供机构：

探索数据科学学院

创建时间：

2019-06-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集