Tamil-Sinhala-English Parallel Corpus
收藏arXiv2022-12-16 更新2024-06-21 收录
下载链接:
https://github.com/aaivu/Tamizhi-Net-OCR
下载链接
链接失效反馈官方服务:
资源简介:
本研究创建了名为Tamil-Sinhala-English Parallel Corpus的数据集,由斯里兰卡莫拉图瓦大学计算机科学与工程系的研究团队开发。该数据集包含100个平行文档,涵盖泰米尔语、僧伽罗语和英语三种语言,总计约2.11M、2.22M和2.33M词。数据来源于斯里兰卡议会的官方文件,通过深度学习技术从使用传统字体的PDF文件中提取文本。此数据集旨在支持机器翻译和语言互操作性研究,解决低资源语言在自然语言处理中的数据稀缺问题。
A dataset named Tamil-Sinhala-English Parallel Corpus was developed in this study by the research team from the Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka. This dataset includes 100 parallel documents across three languages: Tamil, Sinhala, and English, with respective word counts of approximately 2.11 million, 2.22 million, and 2.33 million. The data is sourced from official documents of the Sri Lankan Parliament, and the text was extracted from PDF files using traditional fonts through deep learning technologies. This dataset is intended to support research on machine translation and language interoperability, addressing the issue of data scarcity for low-resource languages in natural language processing.
提供机构:
莫拉图瓦大学计算机科学与工程系
创建时间:
2021-09-13



