A Dataset of Linguistic and Cognitive Features for Analyzing Political Polarization and Manipulation in Discourse (2000-2025)

Name: A Dataset of Linguistic and Cognitive Features for Analyzing Political Polarization and Manipulation in Discourse (2000-2025)
Creator: Science Data Bank
Published: 2025-12-02 10:13:48
License: 暂无描述

DataCite Commons2025-12-02 更新2026-05-05 收录

下载链接：

https://www.scidb.cn/detail?dataSetId=a678d2d4689547ac9e7774ec89553ba5

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset provides a structured corpus of linguistic and cognitive features extracted from political discourse, designed to support quantitative and qualitative analysis of polarization and manipulative communication strategies. The data were compiled through a systematic process of text collection, annotation, and feature extraction spanning the period 2000–2025. Source materials include political speeches, official statements, media transcripts, and social media content from a range of geopolitical contexts, with an emphasis on English-language sources from North American and European political ecosystems. Data processing involved part-of-speech tagging, metaphor identification, frame analysis, and sentiment scoring using natural language processing tools such as spaCy and the VADER sentiment analyzer. Additionally, manual annotation was performed by trained coders to label instances of cognitive bias triggers—such as us-vs-them dichotomies, catastrophic framing, and appeals to authority—based on the theoretical framework of the “Udanian Elephant Effect” model.The dataset is organized into a primary tabular file in CSV format, containing over 50,000 entries. Each row represents a unique text segment (e.g., a sentence or short paragraph), with columns capturing features such as metaphor type, polarity score, bias category, source type, date, and country of origin. Missing data are minimal and primarily occur in metadata fields such as speaker affiliation or geographic location; these are explicitly marked as “NA” to preserve dataset integrity. No significant measurement errors are present, though inter-coder reliability for manual annotations averaged 0.84 (Cohen’s kappa), indicating a high but not perfect level of consistency. The dataset is stored in a standardized, non-proprietary CSV format to ensure broad accessibility and compatibility with common analytical software such as R, Python, and Excel. File size is approximately 85 MB. This resource is intended for researchers in political communication, discourse analysis, and cognitive linguistics seeking to examine how language and cognitive mechanisms interact in the construction of polarized public discourse.

提供机构：

Science Data Bank

创建时间：

2025-12-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集