Bengali datasets for hate speech, sentiment analysis, and document classification

Name: Bengali datasets for hate speech, sentiment analysis, and document classification
Creator: Insight SFI Research Centre for Data Analytics, Data Science Institute, National University of Ireland Galway, Ireland
Published: 2020-04-20 01:21:30
License: 暂无描述

arXiv2020-04-20 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2004.07807v2

下载链接

链接失效反馈

官方服务：

资源简介：

本研究构建了三个针对资源匮乏的孟加拉语的数据集，分别用于检测仇恨言论、情感分析和文档分类。这些数据集从孟加拉语维基百科、新闻文章（如《每日普罗托姆·阿洛》、《每日尤贡托尔》等）、电视频道新闻转储、书籍、博客以及社交媒体（如Twitter、Facebook页面和群组、LinkedIn）中收集，总计包含2.5亿篇文章。数据集的创建过程包括文本收集、预处理（如去除HTML标记、链接、图像标题、数字、特殊字符、哈希标签和多余空格）、词性标注、名词替换、哈希标签规范化、词干提取、停用词移除和低频词移除。这些数据集旨在解决孟加拉语在自然语言处理任务中资源不足的问题，特别是在深度学习模型中的应用。

This study develops three low-resource Bengali datasets tailored for three core natural language processing tasks: hate speech detection, sentiment analysis, and document classification. These datasets are sourced from a wide range of corpora including Bengali Wikipedia, news articles (e.g., Daily Prothom Alo, Daily Jugantor), television channel news dumps, books, blogs, and social media platforms such as Twitter, Facebook pages and groups, and LinkedIn, with a total of 250 million articles. The dataset construction pipeline encompasses text collection, followed by a series of preprocessing steps: removal of HTML tags, hyperlinks, image captions, numbers, special characters, hashtags, and redundant spaces, as well as part-of-speech tagging, noun replacement, hashtag normalization, stemming, stopword removal, and low-frequency word removal. These datasets are designed to mitigate the resource scarcity issue of Bengali in NLP tasks, especially for applications in deep learning models.

提供机构：

Insight SFI Research Centre for Data Analytics, Data Science Institute, National University of Ireland Galway, Ireland

创建时间：

2020-04-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集