five

PolCSBD :Political Counter Speech BD

收藏
DataCite Commons2026-04-27 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/ddvzpjkws7
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset (PolCSBD) was developed to address a critical gap in natural language processing: the detection of political counter-speech in low-resource, code-mixed languages. Our foundational hypothesis was that counter-speech cannot be accurately classified by looking at a single comment in isolation; it fundamentally requires the context of the preceding statement. Additionally, we hypothesized that social media users in Bangladesh heavily use "Banglish" (a phonetic mix of English alphabets and Bengali vocabulary) alongside native Bengali script, which creates a major barrier for standard text classification models. The dataset provides over 10,000 contextual pairs of social media text extracted from political discussions. Each row is structured as a direct conversation, containing a "parent_text" (the initial statement) and a "reply_text" (the direct response). The data demonstrates the complex linguistic reality of the region, featuring native Bengali script, fully Romanized Bengali, and hybrid sentences. It effectively captures how internet users employ historical references, aggressive debate tactics, and sarcasm to challenge political narratives. How to interpret and use the data: This dataset is heavily optimized and provided in a machine-learning-ready format, making it ideal for researchers looking to train, fine-tune, or benchmark Transformer models (such as mBERT, XLM-RoBERTa, or BanglaBERT). It contains exactly three columns: parent_text: The contextual baseline statement, which has been preprocessed to remove noise. reply_text: The responding statement, similarly preprocessed. label: A binary integer classification. A value of '1' indicates Counter-Speech (the reply actively disputes, corrects, or challenges the parent text with a counter-narrative). A value of '0' indicates Non-Counter Speech (the reply simply agrees, adds unrelated noise, or resorts to isolated insults without addressing the argument). Because the text has already undergone strict normalization (noise removal and lowercasing), AI practitioners can directly feed this CSV into tokenizers and neural networks without needing to build complex data-cleaning pipelines from scratch.
提供机构:
Mendeley Data
创建时间:
2026-02-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作