mHossain/Bengali_ParaPhrase_Corpus
收藏Hugging Face2025-06-18 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/mHossain/Bengali_ParaPhrase_Corpus
下载链接
链接失效反馈官方服务:
资源简介:
BanglaParaCorpus是一个大规模的孟加拉语句子改写数据集,包含200万个句子对。该数据集旨在支持孟加拉语的句子改写生成、识别和评估任务。数据集采用多语言枢轴策略(孟加拉语→英语→孟加拉语)构建,并经过全面的过滤管道以确保语义一致性、流畅性和多样性。它包括正式(如新闻文章)和非正式(如社交媒体)文本,适用于广泛的下游NLP应用。
BanglaParaCorpus is a large-scale Bengali paraphrase generation dataset containing 2 million sentence pairs. It is designed to support paraphrase generation, recognition, and evaluation tasks in Bengali, a low-resource language. The dataset is built using a multilingual pivoting strategy (Bengali to English to Bengali) and a comprehensive filtering pipeline to ensure semantic consistency, fluency, and diversity. It includes both formal (e.g., news articles) and informal (e.g., social media) texts, making it suitable for a wide range of downstream NLP applications.
提供机构:
mHossain



