Bangla News Dataset
收藏Mendeley Data2023-01-27 更新2024-06-26 收录
下载链接:
https://data.mendeley.com/datasets/xp92jxr8wn
下载链接
链接失效反馈官方服务:
资源简介:
A corpus on Bangla newspaper articles created using a custom web crawler containing 12 different topics. The total number of word tokens in this dataset is 28.5+ million. The number of unique words is around 3% of the entire vocabulary of the dataset. The Dataset is imbalanced. 20% of the dataset was separated as a held-out dataset.
本数据集为基于定制网络爬虫构建的孟加拉语报纸文章语料库,涵盖12个不同主题。该数据集的词元(Token)总数量达2850余万。数据集中的独特词汇数量约占数据集总词汇量的3%。该数据集存在类别不平衡问题。已将数据集的20%划分为留出数据集(held-out dataset)。
创建时间:
2019-12-09



