Cross-mentions between 4chan and Reddit
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10059084
下载链接
链接失效反馈官方服务:
资源简介:
The included datasets are at the basis of the article "No space for Reddit spacing: Mapping the reflexive relationship between groups on 4chan and Reddit", published in Social Media + Society. They include cross-mentions between 4chan and Reddit, as well as various metrics associated to these cross-references.
The timeframe ranges from the earliest date available (for Reddit: June 2006; for 4chan/b/: April 2006; for 4chan/pol/: December 2013) and ends in January 2023 (except for the 4chan/b/ dataset, which ends in December 2008).
The datasets specifically entail the following:
1. Cross-mentions from Reddit to 4chan
reddit-mentions-to-4chan.csv
I used the Pushshift API's search endpoint to fetch Reddit comments (so no opening posts) with the keyword "4chan" (note: this Pushshift functionality is now deprecated). I also used a rudimentary filter to remove posts by bots, specifically by 1) deleting posts from every account that had "bot" or "auto" in the username and 2) removing all posts by authors with 100 or more contributions and which I manually identified as automated accounts.
I removed URL-only cross-references, i.e. posts that only mentioned "://boards.4chan.org" or "://boards.4channel.org" without another 4chan-reference/
This resulted in 2,638,621 "4chan" references across Reddit.
2. Cross-mentions from 4chan/pol/ to Reddit
4chan-pol_mentions-of-reddit.csv
With a complete dataset of /pol/ collected through 4CAT, I queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor").
I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/
This resulted in 1,640,273 "Reddit" references on /pol/.
3. Cross-mentions from 4chan/b/ to Reddit
4chan-b_mentions-to-reddit.csv
I extracted five million posts from Jason Scott's 4chan/b/ dump. I then queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor").
I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/
This resulted in 1,287 "Reddit" references on /b/.
See Hagen (2020) for more information on the 4chan/b/ dataset.
4. Cross-mention metrics
cross-mention-metrics.xlsx
I extracted the following metrics from the datasets above:
4.1 The total number of cross-mentions, absolute and relative, per monthThis simply used the monthly counts from datasets 1 and 2.
4.2 The most mentioned subreddits on /pol/, per yearUsing the regular expression: r\/[a-zA-Z_]
4.3 Subreddits that mention 4chan most often, per year
4.4 4chan boards mentioned across Reddit, per month
4.5 4chan boards mentioned by subreddits
I counted every subreddit- or board-mention per post instead of total occurrences.
For 4.4 and 4.5, I used the following regular expression to extract 4chan board names:
(\s|^|4chan)\/(a|b|c|d|e|f|g|gif|h|hr|k|m|o|p|t|v|vg|vm|vmg|vr|vrpg|vst|w|wg|i|ic|r9k|s4s|vip|qa|cm|hm|lgbt|y|3|aco|adv|an|bant|biz|cgl|ck|co|diy|fa|fit|gd|hc|his|int|jp|lit|mlp|mu|n|news|out|po|pol|pw|qst|sci|soc|sp|tg|toy|trv|tv|vp|vt|wsg|wsr|x|xs|new)\/(\s|$)
I also omitted 4chan's /r/, /u/, and /s/ boards; despite their small scale, they appeared as false positives due to their unrelated vernacular meaning on Reddit (e.g. /u/ as a username prefix).
4.5 was also transformed and included as a Gephi network file (subreddit-board-mentions.gephi).
Lastly, I also included:
4.6 The total amount of posts on 4chan and Reddit
This was used to calculate 4.1. It uses Pushshift's database statistics (which as of Nov. 2023 requires a login; see this Pastebin for an alternative) and metrics of total 4chan post counts from 4stats.io.
Each of these metrics has their own corresponding tab in the Excel file.
5. Co-words of "4chan" and "reddit" in the cross-mentions
co-words.xslx
Using datasets 1, 2, and 3, I extracted the top ten words appearing directly next to "4chan" on Reddit, and next to "Reddit" on 4chan, per year.
I first pre-processed the text, which involved tokenisation, filtering of unwanted text elements like URLs, stop word removal (I whitelisted back), and lemmatisation.
For the co-word extraction I used a window size of two. I excluded a range of semantically uninteresting words or commonly used hate speech terms prevalent throughout 4chan.
6. Annotated cross-mentions between Reddit and 4chan/pol/ in September 2014
annotations_4chanpol-2014.csvannotations_reddit-2014-kotakuinaction-anonimised.csvannotations_reddit-2014-tumblrinaction-anonimised.csv
I extracted cross-mentions from /pol/ to Reddit and from Reddit to 4chan in September 2014 for close-reading and annotation.
__
The author names are removed for all datasets.
创建时间:
2023-10-31



