five

Cross-mentions between 4chan and Reddit

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10059084
下载链接
链接失效反馈
官方服务:
资源简介:
The included datasets are at the basis of the article "No space for Reddit spacing: Mapping the reflexive relationship between groups on 4chan and Reddit", published in Social Media + Society. They include cross-mentions between 4chan and Reddit, as well as various metrics associated to these cross-references. The timeframe ranges from the earliest date available (for Reddit: June 2006; for 4chan/b/: April 2006; for 4chan/pol/: December 2013) and ends in January 2023 (except for the 4chan/b/ dataset, which ends in December 2008). The datasets specifically entail the following: 1. Cross-mentions from Reddit to 4chan reddit-mentions-to-4chan.csv  I used the Pushshift API's search endpoint to fetch Reddit comments (so no opening posts) with the keyword "4chan" (note: this Pushshift functionality is now deprecated). I also used a rudimentary filter to remove posts by bots, specifically by 1) deleting posts from every account that had "bot" or "auto" in the username and 2) removing all posts by authors with 100 or more contributions and which I manually identified as automated accounts. I removed URL-only cross-references, i.e. posts that only mentioned "://boards.4chan.org" or "://boards.4channel.org" without another 4chan-reference/ This resulted in 2,638,621 "4chan" references across Reddit. 2. Cross-mentions from 4chan/pol/ to Reddit 4chan-pol_mentions-of-reddit.csv With a complete dataset of /pol/ collected through 4CAT, I queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor"). I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/ This resulted in 1,640,273 "Reddit" references on /pol/. 3. Cross-mentions from 4chan/b/ to Reddit 4chan-b_mentions-to-reddit.csv I extracted five million posts from Jason Scott's 4chan/b/ dump. I then queried for "reddit" or the common synonym "plebbit", capital-insensitive, with post- and suffixes allowed (e.g. "Redditor"). I removed URL-only cross-references, i.e. posts that only mentioned "://reddit.com/", "www.reddit.com/", or "i.reddit.com/". without another Reddit-reference/ This resulted in 1,287 "Reddit" references on /b/. See Hagen (2020) for more information on the 4chan/b/ dataset. 4. Cross-mention metrics cross-mention-metrics.xlsx I extracted the following metrics from the datasets above: 4.1 The total number of cross-mentions, absolute and relative, per monthThis simply used the monthly counts from datasets 1 and 2. 4.2 The most mentioned subreddits on /pol/, per yearUsing the regular expression: r\/[a-zA-Z_] 4.3 Subreddits that mention 4chan most often, per year 4.4 4chan boards mentioned across Reddit, per month 4.5 4chan boards mentioned by subreddits I counted every subreddit- or board-mention per post instead of total occurrences. For 4.4 and 4.5, I used the following regular expression to extract 4chan board names: (\s|^|4chan)\/(a|b|c|d|e|f|g|gif|h|hr|k|m|o|p|t|v|vg|vm|vmg|vr|vrpg|vst|w|wg|i|ic|r9k|s4s|vip|qa|cm|hm|lgbt|y|3|aco|adv|an|bant|biz|cgl|ck|co|diy|fa|fit|gd|hc|his|int|jp|lit|mlp|mu|n|news|out|po|pol|pw|qst|sci|soc|sp|tg|toy|trv|tv|vp|vt|wsg|wsr|x|xs|new)\/(\s|$) I also omitted 4chan's /r/, /u/, and /s/ boards; despite their small scale, they appeared as false positives due to their unrelated vernacular meaning on Reddit (e.g. /u/ as a username prefix). 4.5 was also transformed and included as a Gephi network file (subreddit-board-mentions.gephi). Lastly, I also included: 4.6 The total amount of posts on 4chan and Reddit This was used to calculate 4.1. It uses Pushshift's database statistics (which as of Nov. 2023 requires a login; see this Pastebin for an alternative) and metrics of total 4chan post counts from 4stats.io. Each of these metrics has their own corresponding tab in the Excel file. 5. Co-words of "4chan" and "reddit" in the cross-mentions co-words.xslx Using datasets 1, 2, and 3, I extracted the top ten words appearing directly next to "4chan" on Reddit, and next to "Reddit" on 4chan, per year. I first pre-processed the text, which involved tokenisation, filtering of unwanted text elements like URLs, stop word removal (I whitelisted back), and lemmatisation. For the co-word extraction I used a window size of two. I excluded a range of semantically uninteresting words or commonly used hate speech terms prevalent throughout 4chan. 6. Annotated cross-mentions between Reddit and 4chan/pol/ in September 2014 annotations_4chanpol-2014.csvannotations_reddit-2014-kotakuinaction-anonimised.csvannotations_reddit-2014-tumblrinaction-anonimised.csv I extracted cross-mentions from /pol/ to Reddit and from Reddit to 4chan in September 2014 for close-reading and annotation. __  The author names are removed for all datasets.
创建时间:
2023-10-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作