SAHAAYAK 2023
收藏arXiv2023-06-27 更新2024-06-21 收录
下载链接:
https://rb.gy/hf6bp
下载链接
链接失效反馈官方服务:
资源简介:
SAHAAYAK 2023是由阿沙 M. 塔萨迪亚计算机科学与技术学院创建的一个大型多领域双语平行语料库,专注于梵文到印地语的机器翻译。该数据集包含158万对平行句对,涵盖新闻、日常对话、政治、历史、体育和古代印度文学等多个领域。创建过程中,采用了从手工制作小数据集到广泛的数据挖掘、清洗和验证的多方面方法。该数据集旨在解决低资源语言机器翻译的挑战,特别是梵文和印地语之间的翻译,有望在教育和不同社会需求中发挥重要作用。
SAHAAYAK 2023 is a large-scale multi-domain bilingual parallel corpus created by the Asha M. Thadassery College of Computer Science and Technology, focusing on machine translation from Sanskrit to Hindi. This corpus contains 1.58 million parallel sentence pairs, covering multiple domains including news, daily conversations, politics, history, sports and ancient Indian literature. During its development, a multi-faceted approach was adopted, ranging from manually crafted small datasets to extensive data mining, cleaning and validation. This corpus aims to address the challenges of machine translation for low-resource languages, particularly the translation between Sanskrit and Hindi, and is expected to play a vital role in education and meeting various social needs.
提供机构:
阿沙 M. 塔萨迪亚计算机科学与技术学院,乌卡塔萨迪亚大学
创建时间:
2023-06-27



