five

Saptak: A Large-scale Multi-Regional Benchmark Dataset for Poly-Dialectal Neural Machine Translation between Standard Bangla and Regional Dialects, and among Regional Dialects

收藏
DataCite Commons2026-04-15 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/v9cf66fk2t
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is a comprehensive parallel corpus developed for poly-dialectal neural machine translation (NMT), aimed at bridging communication between Standard Bangla and its diverse regional dialects. It is the largest multi-dialect Bengali translation dataset to date (April 4th, 2026), covering eight dialects: Standard Bengali, Sylheti, Chittagonian, Barisal, Noakhali, Rangpur, Rajshahi, and Mymensingh. The dataset includes a substantial number of parallel sentences across these dialects, with 11,056 for Standard Bengali, 10,567 for Chittagonian, 6,422 for Sylheti, 5,712 for Mymensingh, 5,000 for Noakhali, 4,270 for Barisal, 891 for Rajshahi, and 655 for Rangpur, significantly expanding available resources for regional Bangla translation, particularly strengthening representation for Chittagonian and Standard Bengali. ================================================== Table: Comparative distribution of sentence pairs across Standard Bangla and seven regional dialects in existing datasets and the proposed Saptak dataset Dialect Coverage Comparison Across Bengali NLP Datasets Ancholik-NER Std: 3481 | Syl: 3481 | Chit: 3481 | Bar: 3481 | Noa: 3481 | Ran: 0 | Raj: 0 | Mym: 3481 Anubhuti Std: 2500 | Syl: 2500 | Chit: 2500 | Bar: 0 | Noa: 0 | Ran: 0 | Raj: 0 | Mym: 2500 Dialect BD Std: 3452 | Syl: 442 | Chit: 577 | Bar: 790 | Noa: 0 | Ran: 655 | Raj: 891 | Mym: 712 Bhasabodh Std: 980 | Syl: 980 | Chit: 980 | Bar: 0 | Noa: 0 | Ran: 0 | Raj: 0 | Mym: 0 Chatgaiya Alap Std: 4011 | Syl: 0 | Chit: 4011 | Bar: 0 | Noa: 0 | Ran: 0 | Raj: 0 | Mym: 0 Onubad Std: 980 | Syl: 980 | Chit: 980 | Bar: 980 | Noa: 0 | Ran: 0 | Raj: 0 | Mym: 0 Vhasantor Std: 2500 | Syl: 2500 | Chit: 2500 | Bar: 0 | Noa: 2500 | Ran: 0 | Raj: 0 | Mym: 2500 Saptak (Ours) Std: 11056 | Syl: 6422 | Chit: 10567 | Bar: 4270 | Noa: 5000 | Ran: 655 | Raj: 891 | Mym: 5712
提供机构:
Mendeley Data
创建时间:
2026-04-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作