A dataset of transliterations and pronunciations of ancient Isan scripts

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/cy4c8dwb7t

下载链接

链接失效反馈

官方服务：

资源简介：

This paper presents the first parallel dataset of the transliterations and pronunciations of ancient Isan scripts. This dataset is a crucial resource for natural language processing tasks and for fostering Isan cultural preservation in Thailand. The dataset was constructed by collecting data from ancient Isan medicine books that were translated from palm-leaf manuscripts of Isan medicinal recipes. The Isan transliteration sentences were transliterated from an ancient Isan manuscript and translated into Isan pronunciation. The construction process involved data collection from the Research Institute of Northeastern Arts and Culture, Mahasarakham University, Mahasarakham Province, Thailand; meticulous data cleaning; and sentence alignment. The dataset format consists of sentence pairs that include the transliteration of ancient Isan script alongside their corresponding pronunciations, and it features annotations of part of speech. The final dataset contains 4,548 sentence pairs and 214,489 words. The feature consists of 15 POS tags. The data is stored in CVS format in two files: Isan_Medicine_Corpus_sentence.csv, which contains parallel sentences of transliterations and pronunciation, and Isan_Medicine_Corpus_word.csv, which contains tokenized sentences with POS tagging. This dataset has the potential to significantly enhance machine translation accuracy in the Isan language and unlock the rich local wisdom in ancient Isan medicine in Thailand, particularly by providing more contextually relevant translations and improving the understanding of cultural nuances in medical terminology.

创建时间：

2026-03-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集