five

Bengali-English parallel corpus

收藏
arXiv2020-10-07 更新2024-06-21 收录
下载链接:
https://github.com/csebuetnlp/banglanmt
下载链接
链接失效反馈
官方服务:
资源简介:
本研究构建的Bengali-English平行语料库由孟加拉工程技术大学开发,包含275万个高质量的句子对,旨在解决Bengali语言在机器翻译资源不足的问题。数据集通过定制的Bengali句子分割器和创新的对齐器集成与批量过滤方法创建,覆盖多个领域,显著提升了Bengali-English机器翻译的质量。该数据集的应用领域主要集中在提升低资源语言的机器翻译性能,特别是Bengali-English翻译,通过提供大规模、高质量的训练数据,推动了该领域的研究进展。

The Bengali-English parallel corpus constructed in this study was developed by Bangladesh University of Engineering and Technology, containing 2.75 million high-quality sentence pairs. This resource was designed to address the scarcity of machine translation resources for the Bengali language. Using a custom Bengali sentence segmenter, an innovative aligner integration and batch filtering workflow, the corpus spans multiple domains and significantly improves the quality of Bengali-English machine translation. The primary application of this dataset focuses on enhancing machine translation performance for low-resource languages, particularly Bengali-English translation. By providing large-scale, high-quality training data, it has promoted research advancements in this domain.
提供机构:
孟加拉工程技术大学
创建时间:
2020-09-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作