PolCSBD :Political Counter Speech BD
收藏DataCite Commons2026-04-27 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/ddvzpjkws7
下载链接
链接失效反馈官方服务:
资源简介:
This dataset (PolCSBD) was developed to address a critical gap in natural language processing: the detection of political counter-speech in low-resource, code-mixed languages. Our foundational hypothesis was that counter-speech cannot be accurately classified by looking at a single comment in isolation; it fundamentally requires the context of the preceding statement. Additionally, we hypothesized that social media users in Bangladesh heavily use "Banglish" (a phonetic mix of English alphabets and Bengali vocabulary) alongside native Bengali script, which creates a major barrier for standard text classification models.
The dataset provides over 10,000 contextual pairs of social media text extracted from political discussions. Each row is structured as a direct conversation, containing a "parent_text" (the initial statement) and a "reply_text" (the direct response). The data demonstrates the complex linguistic reality of the region, featuring native Bengali script, fully Romanized Bengali, and hybrid sentences. It effectively captures how internet users employ historical references, aggressive debate tactics, and sarcasm to challenge political narratives.
How to interpret and use the data:
This dataset is heavily optimized and provided in a machine-learning-ready format, making it ideal for researchers looking to train, fine-tune, or benchmark Transformer models (such as mBERT, XLM-RoBERTa, or BanglaBERT).
It contains exactly three columns:
parent_text: The contextual baseline statement, which has been preprocessed to remove noise.
reply_text: The responding statement, similarly preprocessed.
label: A binary integer classification. A value of '1' indicates Counter-Speech (the reply actively disputes, corrects, or challenges the parent text with a counter-narrative). A value of '0' indicates Non-Counter Speech (the reply simply agrees, adds unrelated noise, or resorts to isolated insults without addressing the argument).
Because the text has already undergone strict normalization (noise removal and lowercasing), AI practitioners can directly feed this CSV into tokenizers and neural networks without needing to build complex data-cleaning pipelines from scratch.
提供机构:
Mendeley Data
创建时间:
2026-02-19



