DatarrX/myX-Burmese-Morpho-Synthetic
收藏Hugging Face2026-04-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/DatarrX/myX-Burmese-Morpho-Synthetic
下载链接
链接失效反馈官方服务:
资源简介:
myX-Burmese-Morpho-Synthetic 是一个高容量的合成增强数据集,包含超过 3780 万行的缅甸语单词形态组合。该数据集由 DatarrX 组织的 Khant Sint Heinn (Kalix Louis) 开发,旨在推动缅甸语在自然语言处理(NLP)领域的结构理解。数据集的主要目标是提供大量符合缅甸语语法规则的单词组合,虽然许多生成的字符串在语义上可能是“无意义”的(合成的),但它们严格遵循缅甸语的形态规则。这使得它成为需要结构分析而非语义理解的任务的宝贵资源。数据集采用基于规则的合成增强方法生成,使用了来自开源词典的高质量词根,并结合了各种语言标记。数据集包含三列:root(词根)、suffix(后缀或功能助词)和 combined(生成的缅甸语单词或短语)。数据集以 JSONL 格式发布,遵循 Apache 2.0 许可证。
myX-Burmese-Morpho-Synthetic is a high-volume, synthetically augmented dataset consisting of over 37.8 million rows of Burmese word formations. Developed by Khant Sint Heinn (Kalix Louis) under the DatarrX organization, this resource is designed to advance the structural understanding of the Burmese language in the field of Natural Language Processing (NLP). The primary goal of this dataset is to improve Burmese NLP by providing a massive scale of grammatically structured word combinations. While many of the generated strings may be semantically meaningless (synthetic), they strictly follow the morphological rules of the Burmese language. This makes it an invaluable resource for tasks that require structural analysis rather than semantic understanding. The dataset was generated using a Rule-Based Synthetic Augmentation approach, utilizing high-quality root words from open-source dictionaries and combining them with various linguistic markers. It includes three columns: root (the base word), suffix (the affix or functional particle attached), and combined (the final generated Burmese word/phrase). The dataset is released in JSONL format under the Apache License 2.0.
提供机构:
DatarrX



