Bangla Document Dataset

Name: Bangla Document Dataset
Creator: 宾夕法尼亚州立大学
Published: 2023-08-22 06:18:09
License: 暂无描述

arXiv2023-08-22 更新2024-06-21 收录

下载链接：

https://cutt.ly/SYTV6Pv

下载链接

链接失效反馈

官方服务：

资源简介：

Bangla Document Dataset是由宾夕法尼亚州立大学和查尔斯达尔文大学的研究人员共同创建的综合性数据集，包含212,184篇孟加拉语文档，涵盖政府与政治、科学与技术、经济、健康与生活方式、娱乐、艺术与文学以及体育等七个类别。该数据集通过人工标注确保质量，旨在支持孟加拉语自然语言处理的研究。数据集的创建过程涉及从多个新闻门户和博客中收集文本，使用定制的Python网络爬虫进行抓取，并经过去重和内容分析处理。该数据集的应用领域包括文本分类、特征提取和深度学习模型的训练，旨在解决孟加拉语数据资源稀缺的问题，推动相关技术的发展。

The Bangla Document Dataset is a comprehensive dataset jointly created by researchers from Pennsylvania State University and Charles Darwin University. It contains 212,184 Bangla documents spanning seven categories: government and politics, science and technology, economy, health and lifestyle, entertainment, art and literature, and sports. This dataset ensures data quality through manual annotation, with the goal of supporting research in Bangla natural language processing. The development process of the dataset involved collecting texts from multiple news portals and blogs, scraping the content using custom Python web crawlers, followed by deduplication and content analysis. Its application scenarios include text classification, feature extraction, and training of deep learning models, aiming to address the scarcity of Bangla language data resources and promote the development of related technologies.

提供机构：

宾夕法尼亚州立大学

创建时间：

2023-08-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集