five

A dataset of transliterations and pronunciations of ancient Isan scripts

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/cy4c8dwb7t
下载链接
链接失效反馈
官方服务:
资源简介:
This paper presents the first parallel dataset of the transliterations and pronunciations of ancient Isan scripts. This dataset is a crucial resource for natural language processing tasks and for fostering Isan cultural preservation in Thailand. The dataset was constructed by collecting data from ancient Isan medicine books that were translated from palm-leaf manuscripts of Isan medicinal recipes. The Isan transliteration sentences were transliterated from an ancient Isan manuscript and translated into Isan pronunciation. The construction process involved data collection from the Research Institute of Northeastern Arts and Culture, Mahasarakham University, Mahasarakham Province, Thailand; meticulous data cleaning; and sentence alignment. The dataset format consists of sentence pairs that include the transliteration of ancient Isan script alongside their corresponding pronunciations, and it features annotations of part of speech. The final dataset contains 4,548 sentence pairs and 214,489 words. The feature consists of 15 POS tags. The data is stored in CVS format in two files: Isan_Medicine_Corpus_sentence.csv, which contains parallel sentences of transliterations and pronunciation, and Isan_Medicine_Corpus_word.csv, which contains tokenized sentences with POS tagging. This dataset has the potential to significantly enhance machine translation accuracy in the Isan language and unlock the rich local wisdom in ancient Isan medicine in Thailand, particularly by providing more contextually relevant translations and improving the understanding of cultural nuances in medical terminology.
创建时间:
2026-03-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作