Bengali-English parallel corpus

Name: Bengali-English parallel corpus
Creator: 孟加拉工程技术大学
Published: 2020-10-07 13:33:13
License: 暂无描述

arXiv2020-10-07 更新2024-06-21 收录

下载链接：

https://github.com/csebuetnlp/banglanmt

下载链接

链接失效反馈

官方服务：

资源简介：

本研究构建的Bengali-English平行语料库由孟加拉工程技术大学开发，包含275万个高质量的句子对，旨在解决Bengali语言在机器翻译资源不足的问题。数据集通过定制的Bengali句子分割器和创新的对齐器集成与批量过滤方法创建，覆盖多个领域，显著提升了Bengali-English机器翻译的质量。该数据集的应用领域主要集中在提升低资源语言的机器翻译性能，特别是Bengali-English翻译，通过提供大规模、高质量的训练数据，推动了该领域的研究进展。

The Bengali-English parallel corpus constructed in this study was developed by Bangladesh University of Engineering and Technology, containing 2.75 million high-quality sentence pairs. This resource was designed to address the scarcity of machine translation resources for the Bengali language. Using a custom Bengali sentence segmenter, an innovative aligner integration and batch filtering workflow, the corpus spans multiple domains and significantly improves the quality of Bengali-English machine translation. The primary application of this dataset focuses on enhancing machine translation performance for low-resource languages, particularly Bengali-English translation. By providing large-scale, high-quality training data, it has promoted research advancements in this domain.

提供机构：

孟加拉工程技术大学

创建时间：

2020-09-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集