five

Sinhala-English Parallel Word Dictionary Dataset

收藏
arXiv2023-08-04 更新2024-06-21 收录
下载链接:
https://github.com/kasunw22/sinhala-para-dict
下载链接
链接失效反馈
官方服务:
资源简介:
本研究介绍了三个英语-僧伽罗语平行词典数据集(En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText),旨在支持英语和僧伽罗语之间的多语言自然语言处理任务。这些数据集由莫勒图沃大学计算机科学与工程系的Kasun Wickramasinghe和Nisansa de Silva创建,包含546,156个词对。数据集通过使用FastText模型和Google翻译API创建,确保每个词对仅包含单个词,适用于词级别的多语言任务,如词典归纳和监督词嵌入对齐。这些数据集为资源较少的僧伽罗语提供了重要的基础资源,有助于推动该语言的自然语言处理研究。

This study introduces three English-Sinhala parallel dictionary datasets: En-Si-dict-large, En-Si-dict-filtered, and En-Si-dict-FastText, which are designed to support multilingual natural language processing (NLP) tasks between English and Sinhala. Developed by Kasun Wickramasinghe and Nisansa de Silva from the Department of Computer Science and Engineering, University of Moratuwa, these datasets contain 546,156 word pairs. The datasets were constructed using the FastText model and the Google Translate API, with each word pair guaranteed to contain only a single word, making them applicable to word-level multilingual tasks such as dictionary induction and supervised word embedding alignment. These datasets provide critical foundational resources for the low-resource Sinhala language, facilitating the advancement of natural language processing research for this language.
提供机构:
莫勒图沃大学计算机科学与工程系
创建时间:
2023-08-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作