five

Machine translation for low resource Thai-English-Myanmar language pairs

收藏
DataCite Commons2023-09-22 更新2025-04-16 收录
下载链接:
http://doi.nrct.go.th/?page=resolve_doi&resolve_doi=10.14457/TU.the.2022.757
下载链接
链接失效反馈
官方服务:
资源简介:
With the rapid growth of Internet users around the world, languages have become crucial to our ability to access information, education, and communication on the Internet. In the case of people such as us who are not native English speakers, the language barrier makes it difficult for us to access the information. There is no doubt that Thai and Myanmar languages are included in the low-resource language group. The majority of natural language processing (NLP) research has been conducted on high-resource languages, whereas research on low-resource languages has been scarce. Based on this research gap, the authors decided to concentrate on the development of NLP research for our low-resource languages. Besides the low-resource data problem, finding an appropriate MT method due to the difference in nature of the languages is a challenging task for our thesis. When considering theappropriate MT approach for Thai and Myanmar languages, reverse grammatical structures as well as syllable-based written forms are key factors to consider.To solve these research problems, our thesis proposes three methods for improving machine translation (MT) performance for low-resource English-Thai-Myanmar language pairs, including 1) EDITOR and Levenshtein transformer (LevT) models, 2) SwitchOut data augmentation algorithm, and 3) mBART-50 pre-trained model. As part of our contribution to low-resource MT, we built our own corpus. Our proposed methods were tested on the newly constructed En-My-Th medical corpus and the existing ASEAN-MT corpus. The results of our experiment indicate that contributions from linguistics, such as building a good-quality corpus and using data augmentation, can substantially improve the translation score for low-resource language pairs. Particularly, applying the SwitchOut data augmentation algorithm significantly increases the scores for English ⇔ Thai translation, which is over 30 BLEU. For English-Myanmar with syllable segmentation in the Myanmar side achieves over 30 BLEU. The SwitchOut approach also yields over 20 BLEU for the Thai ⇔ Myanmar translation pairs, which is also good translation score. Moreover, we found out that the latest trending NMT method which is fine-tuning the pre-trained mBART model can also provide satisfactory translation performance for our low-resource English ⇔ Myanmar and English ⇔ Thai language pairs.
提供机构:
Thammasat University
创建时间:
2023-09-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作