Bangla2B+

Name: Bangla2B+
Creator: 孟加拉工程技术大学
Published: 2022-05-10 13:30:12
License: 暂无描述

arXiv2022-05-10 更新2024-06-21 收录

下载链接：

https://github.com/csebuetnlp/banglabert

下载链接

链接失效反馈

官方服务：

资源简介：

Bangla2B+是一个专为低资源语言Bangla设计的预训练数据集，由孟加拉工程技术大学的研究团队创建。该数据集通过爬取110个流行的Bangla网站，收集了约27.5GB的数据，涵盖了百科全书、新闻、博客、电子书、故事和社交媒体等多种内容。创建过程中，研究团队对数据进行了彻底的去重和过滤，确保了数据的质量。Bangla2B+数据集主要用于支持Bangla语言的自然语言理解应用，特别是在零样本跨语言转移学习方面，旨在提升Bangla语言在自然语言处理领域的应用和研究。

Bangla2B+ is a pre-training dataset specifically designed for the low-resource language Bangla, created by a research team from Bangladesh University of Engineering and Technology. This dataset was constructed by crawling 110 popular Bangla-language websites, collecting approximately 27.5 GB of data covering diverse content types including encyclopedias, news articles, blogs, e-books, stories, and social media content. During the dataset creation process, the research team conducted thorough deduplication and filtering to ensure high data quality. The Bangla2B+ dataset is primarily utilized to support natural language understanding applications for the Bangla language, with a particular emphasis on zero-shot cross-lingual transfer learning, with the ultimate goal of advancing the application and research of the Bangla language in the field of natural language processing.

提供机构：

孟加拉工程技术大学

创建时间：

2021-01-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集