ToxLex_bn: A Curated Dataset of Bangla Toxic Language Derived from Facebook Comment
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/9pz8ssmc49
下载链接
链接失效反馈官方服务:
资源简介:
ToxLex or Lexicon of toxic language is a dataset having the aggressive and abusive bad words used in social media, Specifically, this dataset contains utterances from the user-generated comments of Facebook. The texts cover the demographic and thematic distribution of Bangla's toxic language on social media. The data have been extracted from 8 publicly open Facebook pages. This dataset is a curated, de-duplicated, anonymized dataset that is derived from raw comments. The dataset contains 1959 rows with 08 columns and each row represents a toxic bigram with its corresponding features such as transcriptions, translation, spelling standards, and degree of toxicity. This dataset is single human-annotated and curated to define classifiers for toxic language detection systems. Apart from this, it is considered a wordlist having Bangla cyberbullying, hate speech, and slang.
Warning: this dataset contains text content that may be distressing or upsetting.
创建时间:
2022-04-27



