five

ChakmaBridge: A Five-Way Parallel Corpus for Navigating the Script Divide in an Endangered Language

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/gjhk4jmbfy
下载链接
链接失效反馈
官方服务:
资源简介:
The advancement of NLP technologies for low-resource and endangered languages is critically hindered by the scarcity of high-quality, parallel corpora. This is particularly true for languages like Chakma, which also faces the challenge of prevalent non-standard, romanized script usage in digital communication. To address this, we introduce ChakmaBridge, the first five-way parallel corpus for Chakma, containing 807 sentences aligned across English, Standard Bangla, Bengali-script Chakma, Romanized Bangla, and Romanized Chakma. Our dataset is created by augmenting the MELD corpus with LLM-generated romanizations that are rigorously validated by native speakers. We release ChakmaBridge to facilitate research in low-resource MT and aid in the digital preservation of this endangered language. Citation: Rahman, Md Abdur, Md Tofael Ahmed Bhuiyan, and Abdul Kadar Muhammad Masum. "ChakmaBridge: A Five-Way Parallel Corpus for Navigating the Script Divide in an Endangered Language." In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pp. 259-265. 2025.
创建时间:
2026-03-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作