ChakmaBridge: A Five-Way Parallel Corpus for Navigating the Script Divide in an Endangered Language
收藏DataCite Commons2026-03-18 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/gjhk4jmbfy
下载链接
链接失效反馈官方服务:
资源简介:
The advancement of NLP technologies for low-resource and endangered languages is critically hindered by the scarcity of high-quality, parallel corpora. This is particularly true for languages like Chakma, which also faces the challenge of prevalent non-standard, romanized script usage in digital communication. To address this, we introduce ChakmaBridge, the first five-way parallel corpus for Chakma, containing 807 sentences aligned across English, Standard Bangla, Bengali-script Chakma, Romanized Bangla, and Romanized Chakma. Our dataset is created by augmenting the MELD corpus with LLM-generated romanizations that are rigorously validated by native speakers. We release ChakmaBridge to facilitate research in low-resource MT and aid in the digital preservation of this endangered language.
Citation: Rahman, Md Abdur, Md Tofael Ahmed Bhuiyan, and Abdul Kadar Muhammad Masum. "ChakmaBridge: A Five-Way Parallel Corpus for Navigating the Script Divide in an Endangered Language." In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pp. 259-265. 2025.
提供机构:
Mendeley Data
创建时间:
2026-03-18



