five

SAHAAYAK 2023

收藏
arXiv2023-06-27 更新2024-06-21 收录
下载链接:
https://rb.gy/hf6bp
下载链接
链接失效反馈
官方服务:
资源简介:
SAHAAYAK 2023是由阿沙 M. 塔萨迪亚计算机科学与技术学院创建的一个大型多领域双语平行语料库,专注于梵文到印地语的机器翻译。该数据集包含158万对平行句对,涵盖新闻、日常对话、政治、历史、体育和古代印度文学等多个领域。创建过程中,采用了从手工制作小数据集到广泛的数据挖掘、清洗和验证的多方面方法。该数据集旨在解决低资源语言机器翻译的挑战,特别是梵文和印地语之间的翻译,有望在教育和不同社会需求中发挥重要作用。

SAHAAYAK 2023 is a large-scale multi-domain bilingual parallel corpus created by the Asha M. Thadassery College of Computer Science and Technology, focusing on machine translation from Sanskrit to Hindi. This corpus contains 1.58 million parallel sentence pairs, covering multiple domains including news, daily conversations, politics, history, sports and ancient Indian literature. During its development, a multi-faceted approach was adopted, ranging from manually crafted small datasets to extensive data mining, cleaning and validation. This corpus aims to address the challenges of machine translation for low-resource languages, particularly the translation between Sanskrit and Hindi, and is expected to play a vital role in education and meeting various social needs.
提供机构:
阿沙 M. 塔萨迪亚计算机科学与技术学院,乌卡塔萨迪亚大学
创建时间:
2023-06-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作