five

Bangla2B+

收藏
arXiv2022-05-10 更新2024-06-21 收录
下载链接:
https://github.com/csebuetnlp/banglabert
下载链接
链接失效反馈
官方服务:
资源简介:
Bangla2B+是一个专为低资源语言Bangla设计的预训练数据集,由孟加拉工程技术大学的研究团队创建。该数据集通过爬取110个流行的Bangla网站,收集了约27.5GB的数据,涵盖了百科全书、新闻、博客、电子书、故事和社交媒体等多种内容。创建过程中,研究团队对数据进行了彻底的去重和过滤,确保了数据的质量。Bangla2B+数据集主要用于支持Bangla语言的自然语言理解应用,特别是在零样本跨语言转移学习方面,旨在提升Bangla语言在自然语言处理领域的应用和研究。

Bangla2B+ is a pre-training dataset specifically designed for the low-resource language Bangla, created by a research team from Bangladesh University of Engineering and Technology. This dataset was constructed by crawling 110 popular Bangla-language websites, collecting approximately 27.5 GB of data covering diverse content types including encyclopedias, news articles, blogs, e-books, stories, and social media content. During the dataset creation process, the research team conducted thorough deduplication and filtering to ensure high data quality. The Bangla2B+ dataset is primarily utilized to support natural language understanding applications for the Bangla language, with a particular emphasis on zero-shot cross-lingual transfer learning, with the ultimate goal of advancing the application and research of the Bangla language in the field of natural language processing.
提供机构:
孟加拉工程技术大学
创建时间:
2021-01-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作