Saptak: A Large-scale Multi-Regional Benchmark Dataset for Poly-Dialectal Neural Machine Translation between Standard Bangla and Regional Dialects, and among Regional Dialects
收藏DataCite Commons2026-04-15 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/v9cf66fk2t
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is a comprehensive parallel corpus developed for poly-dialectal neural machine translation (NMT), aimed at bridging communication between Standard Bangla and its diverse regional dialects. It is the largest multi-dialect Bengali translation dataset to date (April 4th, 2026), covering eight dialects: Standard Bengali, Sylheti, Chittagonian, Barisal, Noakhali, Rangpur, Rajshahi, and Mymensingh. The dataset includes a substantial number of parallel sentences across these dialects, with 11,056 for Standard Bengali, 10,567 for Chittagonian, 6,422 for Sylheti, 5,712 for Mymensingh, 5,000 for Noakhali, 4,270 for Barisal, 891 for Rajshahi, and 655 for Rangpur, significantly expanding available resources for regional Bangla translation, particularly strengthening representation for Chittagonian and Standard Bengali.
==================================================
Table: Comparative distribution of sentence pairs across Standard Bangla and seven regional dialects in existing datasets and the proposed Saptak dataset
Dialect Coverage Comparison Across Bengali NLP Datasets
Ancholik-NER
Std: 3481 | Syl: 3481 | Chit: 3481 | Bar: 3481 | Noa: 3481 | Ran: 0 | Raj: 0 | Mym: 3481
Anubhuti
Std: 2500 | Syl: 2500 | Chit: 2500 | Bar: 0 | Noa: 0 | Ran: 0 | Raj: 0 | Mym: 2500
Dialect BD
Std: 3452 | Syl: 442 | Chit: 577 | Bar: 790 | Noa: 0 | Ran: 655 | Raj: 891 | Mym: 712
Bhasabodh
Std: 980 | Syl: 980 | Chit: 980 | Bar: 0 | Noa: 0 | Ran: 0 | Raj: 0 | Mym: 0
Chatgaiya Alap
Std: 4011 | Syl: 0 | Chit: 4011 | Bar: 0 | Noa: 0 | Ran: 0 | Raj: 0 | Mym: 0
Onubad
Std: 980 | Syl: 980 | Chit: 980 | Bar: 980 | Noa: 0 | Ran: 0 | Raj: 0 | Mym: 0
Vhasantor
Std: 2500 | Syl: 2500 | Chit: 2500 | Bar: 0 | Noa: 2500 | Ran: 0 | Raj: 0 | Mym: 2500
Saptak (Ours)
Std: 11056 | Syl: 6422 | Chit: 10567 | Bar: 4270 | Noa: 5000 | Ran: 655 | Raj: 891 | Mym: 5712
提供机构:
Mendeley Data
创建时间:
2026-04-15



