AFROMT
收藏arXiv2021-09-10 更新2024-06-21 收录
下载链接:
https://github.com/machelreid/afromt
下载链接
链接失效反馈官方服务:
资源简介:
AFROMT是一个针对8种非洲语言的标准化、清洁且可重现的机器翻译基准数据集,由东京大学和卡内基梅隆大学的研究团队开发。该数据集包含英语与Afrikaans, Xhosa, Zulu, Rundi, Sesotho, Swahili, Bemba, Lingala这八种语言之间的翻译任务,覆盖了2.25亿母语和第二语言使用者。数据集的创建过程包括从开放源代码库如OPUS和ParaCrawl中收集平行数据,并通过自动过滤技术和人工验证进行数据清洗和标准化。AFROMT数据集的应用领域主要集中在低资源语言的机器翻译研究,旨在解决这些语言在数字化文本数据不足的问题,并推动机器翻译技术在这些语言中的应用和发展。
AFROMT is a standardized, clean and reproducible machine translation benchmark dataset for 8 African languages, developed by research teams from the University of Tokyo and Carnegie Mellon University. It covers translation tasks between English and eight languages: Afrikaans, Xhosa, Zulu, Rundi, Sesotho, Swahili, Bemba and Lingala, which are used by 225 million native and second-language speakers. The process of creating the AFROMT dataset involves collecting parallel data from open-source repositories such as OPUS and ParaCrawl, followed by data cleaning and standardization using automatic filtering techniques and manual verification. The primary application scenarios of the AFROMT dataset focus on machine translation research for low-resource languages, aiming to solve the shortage of digital text data for these languages and promote the application and development of machine translation technologies in these languages.
提供机构:
东京大学, 卡内基梅隆大学
创建时间:
2021-09-10



