Sinhala-English Parallel Word Dictionary Dataset
收藏arXiv2023-08-04 更新2024-06-21 收录
下载链接:
https://github.com/kasunw22/sinhala-para-dict
下载链接
链接失效反馈官方服务:
资源简介:
本研究介绍了三个英语-僧伽罗语平行词典数据集(En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText),旨在支持英语和僧伽罗语之间的多语言自然语言处理任务。这些数据集由莫勒图沃大学计算机科学与工程系的Kasun Wickramasinghe和Nisansa de Silva创建,包含546,156个词对。数据集通过使用FastText模型和Google翻译API创建,确保每个词对仅包含单个词,适用于词级别的多语言任务,如词典归纳和监督词嵌入对齐。这些数据集为资源较少的僧伽罗语提供了重要的基础资源,有助于推动该语言的自然语言处理研究。
This study introduces three English-Sinhala parallel dictionary datasets: En-Si-dict-large, En-Si-dict-filtered, and En-Si-dict-FastText, which are designed to support multilingual natural language processing (NLP) tasks between English and Sinhala. Developed by Kasun Wickramasinghe and Nisansa de Silva from the Department of Computer Science and Engineering, University of Moratuwa, these datasets contain 546,156 word pairs. The datasets were constructed using the FastText model and the Google Translate API, with each word pair guaranteed to contain only a single word, making them applicable to word-level multilingual tasks such as dictionary induction and supervised word embedding alignment. These datasets provide critical foundational resources for the low-resource Sinhala language, facilitating the advancement of natural language processing research for this language.
提供机构:
莫勒图沃大学计算机科学与工程系
创建时间:
2023-08-04



