Machine translation for low resource Thai-English-Myanmar language pairs
收藏DataCite Commons2023-09-22 更新2025-04-16 收录
下载链接:
http://doi.nrct.go.th/?page=resolve_doi&resolve_doi=10.14457/TU.the.2022.757
下载链接
链接失效反馈官方服务:
资源简介:
With the rapid growth of Internet users around the world, languages have become crucial to our ability to access information, education, and communication on the Internet. In the case of people such as us who are not native English speakers, the language barrier makes it difficult for us to access the information. There is no doubt that Thai and Myanmar languages are included in the low-resource language group. The majority of natural language processing (NLP) research has been conducted on high-resource languages, whereas research on low-resource languages has been scarce. Based on this research gap, the authors decided to concentrate on the development of NLP research for our low-resource languages. Besides the low-resource data problem, finding an appropriate MT method due to the difference in nature of the languages is a challenging task for our thesis. When considering theappropriate MT approach for Thai and Myanmar languages, reverse grammatical structures as well as syllable-based written forms are key factors to consider.To solve these research problems, our thesis proposes three methods for improving machine translation (MT) performance for low-resource English-Thai-Myanmar language pairs, including 1) EDITOR and Levenshtein transformer (LevT) models, 2) SwitchOut data augmentation algorithm, and 3) mBART-50 pre-trained model. As part of our contribution to low-resource MT, we built our own corpus. Our proposed methods were tested on the newly constructed En-My-Th medical corpus and the existing ASEAN-MT corpus. The results of our experiment indicate that contributions from linguistics, such as building a good-quality corpus and using data augmentation, can substantially improve the translation score for low-resource language pairs. Particularly, applying the SwitchOut data augmentation algorithm significantly increases the scores for English ⇔ Thai translation, which is over 30 BLEU. For English-Myanmar with syllable segmentation in the Myanmar side achieves over 30 BLEU. The SwitchOut approach also yields over 20 BLEU for the Thai ⇔ Myanmar translation pairs, which is also good translation score. Moreover, we found out that the latest trending NMT method which is fine-tuning the pre-trained mBART model can also provide satisfactory translation performance for our low-resource English ⇔ Myanmar and English ⇔ Thai language pairs.
提供机构:
Thammasat University
创建时间:
2023-09-22



