Aligned Latin-Myanmar Transliteration Dataset

NIAID Data Ecosystem2026-03-14 收录

下载链接：

https://zenodo.org/record/7344772

下载链接

链接失效反馈

官方服务：

资源简介：

Aligned Latin-Myanmar Transliteration Dataset Chenchen Ding Tue Nov 22 00:00:00 JST 2022 * Introduction This data set is a further refined and annotated version of the data at https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/western-myanmar-transliteration.zip The data set is developed by Chenchen Ding from NICT. The license is Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License https://creativecommons.org/licenses/by-nc-sa/4.0/ * Contents - data.txt : 42,736 segmented and aligned instances. * Format Each line contains a segmented transliteration pair in a format of [Latin segment 1] | [Latin segment 2] ... ||| [Myanmar segment 1] | [Myamar segment 2] | ... where the Latin-Myanmar pair has identical number of segments. * Annotation Guidelines - There is no insertion but only segmentation on the Latin side. - A placeholder @ is inserted in the Myanmar side for unaligned Latin segments. - The consonant clusters at syllable onset are generally segmented and aligned to Myanmar basic letters - The consonants at coda are generally aligned to the placeholder, unless they are absorbed by a rhyme with nasalization or glottal stop, or by an extra explicit killed-letter. - Doubled consonant letters are generally segmented and treated as coda and onset of two neighboring syllables. - Myanmar rhymes are generally not segmented. - The Myanmar letter A (0x1021) is unsegmented in the case of vowel-beginning words as no insertion on Latin side. The data can be directly used to train a sequence-labeling model for Myanmar Romanization. * Disclaimer [1] NICT bears no responsibility for the contents of the corpus and the lexicon and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the corpus or the lexicon. [2] If any copyright infringement or other problems are found in the corpus or the lexicon, please contact us at alt-info [at] khn [dot] nict [dot] go [dot] jp. We will review the issue and undertake appropriate measures when needed.

创建时间：

2022-11-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集