five

Aligned Latin-Myanmar Transliteration Dataset

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7344772
下载链接
链接失效反馈
官方服务:
资源简介:
Aligned Latin-Myanmar Transliteration Dataset                     Chenchen Ding                     Tue Nov 22 00:00:00 JST 2022 * Introduction This data set is a further refined and annotated version of the data at https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/western-myanmar-transliteration.zip The data set is developed by Chenchen Ding from NICT. The license is Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License https://creativecommons.org/licenses/by-nc-sa/4.0/ * Contents - data.txt : 42,736 segmented and aligned instances. * Format Each line contains a segmented transliteration pair in a format of [Latin segment 1] | [Latin segment 2] ... ||| [Myanmar segment 1] | [Myamar segment 2] | ... where the Latin-Myanmar pair has identical number of segments. * Annotation Guidelines - There is no insertion but only segmentation on the Latin side. - A placeholder @ is inserted in the Myanmar side for unaligned Latin segments. - The consonant clusters at syllable onset are generally segmented and aligned to Myanmar basic letters - The consonants at coda are generally aligned to the placeholder, unless they are absorbed by a rhyme with nasalization or glottal stop, or by an extra explicit killed-letter. - Doubled consonant letters are generally segmented and treated as coda and onset of two neighboring syllables. - Myanmar rhymes are generally not segmented. - The Myanmar letter A (0x1021) is unsegmented in the case of vowel-beginning words as no insertion on Latin side. The data can be directly used to train a sequence-labeling model for Myanmar Romanization. * Disclaimer [1] NICT bears no responsibility for the contents of the corpus and the lexicon and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the corpus or the lexicon. [2] If any copyright infringement or other problems are found in the corpus or the lexicon, please contact us at alt-info [at] khn [dot] nict [dot] go [dot] jp. We will review the issue and undertake appropriate measures when needed.
创建时间:
2022-11-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作