4chan Politically Incorrect Corpus

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14992404

下载链接

链接失效反馈

官方服务：

资源简介：

The present published data are the groundwork for the dissertation Hate Speech in the Digital World: A Linguistic Analysis of the Politically Incorrect 4chan Corpus and consists of seven files that include raw data, processed datasets, Python scripts, and linguistic analysis outputs. The dataset aims to facilitate further research on hate speech, computational linguistics, and online discourse analysis. "Chan_data" is a Jupyter Notebook (.ipynb format) that contains Python code used to extract data from the Politically Incorrect board on 4chan. It generates two datasets: 4chan-PIC, which is the full dataset of extracted posts, and 4chan-HM, a filtered dataset containing only posts with explicit hateful language. The script was developed using Anaconda Navigator and Jupyter Notebook. "chan_comment_final" is an Excel file containing three sheets. The first sheet, 4chan-PIC, includes the complete dataset of extracted posts. The second sheet, 4chan-HM, contains the filtered dataset with only posts that include explicit hateful language. The third sheet, Explicit 4chan-HM Hate List, consists of explicit hate speech words used to filter the 4chan-PIC dataset into 4chan-HM. "Cleaningandseparating" is another Jupyter Notebook that was used to convert the scraped dataset into a textual corpus (.txt format) for the linguistic analysis. "cleaned_text" is a plain text file that contains the finalised textual corpus used to compile the 4chan Politically Incorrect Corpus on Sketch Engine. "4chan-PIC Concordancing" is an Excel file containing concordancing data from Sketch Engine, which presents words in context and was used for linguistic analysis. "4chan-PIC Wordlists" is another Excel file that includes wordlist data from Sketch Engine, providing word and lemma frequency statistics to support quantitative analysis. "4chan-PIC Keyword analysis" is an Excel file that contains keyword analysis data from Sketch Engine, identifying statistically significant keywords in the 4chan Politically Incorrect Corpus compared to a reference corpus, English Web 21. These files support research in hate speech detection and analysis, computational linguistics, corpus-based studies, and online discourse analysis on social media. They also contribute to data-driven studies of extremist language trends. The datasets and scripts are provided for academic research and reproducibility purposes.

创建时间：

2025-03-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集