Bengali-English parallel corpus
收藏arXiv2020-10-07 更新2024-06-21 收录
下载链接:
https://github.com/csebuetnlp/banglanmt
下载链接
链接失效反馈官方服务:
资源简介:
本研究构建的Bengali-English平行语料库由孟加拉工程技术大学开发,包含275万个高质量的句子对,旨在解决Bengali语言在机器翻译资源不足的问题。数据集通过定制的Bengali句子分割器和创新的对齐器集成与批量过滤方法创建,覆盖多个领域,显著提升了Bengali-English机器翻译的质量。该数据集的应用领域主要集中在提升低资源语言的机器翻译性能,特别是Bengali-English翻译,通过提供大规模、高质量的训练数据,推动了该领域的研究进展。
The Bengali-English parallel corpus constructed in this study was developed by Bangladesh University of Engineering and Technology, containing 2.75 million high-quality sentence pairs. This resource was designed to address the scarcity of machine translation resources for the Bengali language.
Using a custom Bengali sentence segmenter, an innovative aligner integration and batch filtering workflow, the corpus spans multiple domains and significantly improves the quality of Bengali-English machine translation.
The primary application of this dataset focuses on enhancing machine translation performance for low-resource languages, particularly Bengali-English translation. By providing large-scale, high-quality training data, it has promoted research advancements in this domain.
提供机构:
孟加拉工程技术大学
创建时间:
2020-09-20



