Kothon: A Large-Scale Dataset for Machine Translation of the Chittagonian and Sylheti Dialects into Standard Bangla
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/2fv6vf9v2z
下载链接
链接失效反馈官方服务:
资源简介:
Chittagonian and Sylheti are two major and complex Bengali dialects spoken by over 24 million Bengali speakers. However, their well-written forms are becoming increasingly rare, putting them at risk of extinction. As these dialects differ significantly from Standard Bangla, they often create communication barriers for non-dialectal speakers. Despite this, very few research efforts have been made to address the issue. Existing resources are limited to small datasets, which are insufficient for effective preservation of dialects. To bridge this gap, this study focuses on the creation and evaluation of large-scale parallel corpora for the Chittagonian-Bangla and Sylheti-Bangla translation. A total of 8,000 Chittagonian and 9,300 Sylheti sentences were collected and annotated by five native dialect-speaking annotators. Standard Bangla sentences were gathered from open-source resources, novels, and existing datasets, complemented with text scanning of printed books. A custom web-based annotation tool was developed to aid the annotation process. The quality and reliability of the datasets were also ensured through a rigorous validation process involving independent native speakers, who reviewed translations. This dataset serves as a valuable resource for advancing research in Bengali language processing and supporting the creation of intelligent systems that help preserve dialects and promote digital communication.
创建时间:
2025-10-28



