BanglaParaphrase

Name: BanglaParaphrase
Creator: 孟加拉工程技术大学
Published: 2022-10-11 10:52:31
License: 暂无描述

arXiv2022-10-11 更新2024-06-21 收录

下载链接：

https://github.com/csebuetnlp/banglaparaphrase

下载链接

链接失效反馈

官方服务：

资源简介：

BanglaParaphrase是一个高质量的孟加拉语同义句数据集，由孟加拉工程技术大学创建。该数据集通过创新的过滤流程确保了语义和多样性，包含466,630条数据，旨在解决孟加拉语在NLP领域的低资源问题。数据集内容丰富，通过网络爬虫从RoarBangla网站获取高质量的孟加拉语句子，并使用先进的翻译模型进行处理。创建过程中，采用了旋转（Pivoting）方法和额外的过滤阶段以保证多样性和语义。该数据集主要应用于增强其他孟加拉语数据集，提高自然语言理解任务的性能，如问答、风格转换、语义解析和数据增强等。

BanglaParaphrase is a high-quality Bengali paraphrase dataset developed by Bangladesh University of Engineering and Technology. It employs an innovative filtering pipeline to ensure semantic consistency and data diversity, and comprises 466,630 data entries. This dataset is intended to address the low-resource issue of the Bengali language in the field of natural language processing (NLP). Featuring rich content, it is built by first collecting high-quality Bengali sentences from the RoarBangla website via web crawling, then processing the acquired data with state-of-the-art translation models. During its development, the pivoting method and an additional filtering stage were adopted to further ensure data diversity and semantic fidelity. This dataset is primarily used to augment other Bengali-language datasets and enhance the performance of natural language understanding tasks, including question answering, style transfer, semantic parsing, and data augmentation.

提供机构：

孟加拉工程技术大学

创建时间：

2022-10-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集