five

187ro/incelset

收藏
Hugging Face2023-10-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/187ro/incelset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - fill-mask tags: - not-for-all-audiences pretty_name: Incel Dataset 🎭 size_categories: - 100K<n<1M language: - en --- # Dataset Card for IncelSet ### Dataset Summary This dataset is based off the incels.is forum and is ⚠️HIGHLY OFFENSIVE⚠️ A compilation of almost 3 years worth of posts, highlighting topics such as (self-described) celibatism, self-views, life-improvement (attempts or advice), suicide, perceived failure, views on women, views on society, views on politcs - from the members' perspective. Co-Authored by inmate & curly for Universiteit van Amsterdam [Politics, Psychology, Law and Economics (PPLE)](https://pple.uva.nl) ### Languages English with a lot of racial slurs, misoginy, mentions of sexual assault and general hatred - do not view or use if easily offended. ## Dataset Structure The dataset consists of 2 colums, "title" - representing the thread title & "text" - representing the user replies (posts) under the thread title ### Source Data Incels.is Forum. #### Initial Data Collection and Normalization 1. We first built a script in GoLang that scrapes all the content of the incel.is Forum. We downloaded roughly 150.000 threads - containing almost 2.1 Million posts - in approximately 9 hours from start to finish - using a dedicated server with 72 cores. 2. We then took the scraped data and started processing it, firstly building a script in Python that processed the data & formatted it into the JSON data format according to (RFC 8259) standards. 3. We then started the removal process of PII (Personal Identifiable Information) - thus anonymizing user posts in the dataset. This wasn't hard to do as users already set up monikers for themselves & never gave out personal information such as full names, addresses or social security numbers, nevertheless we still validated the removal of such data. 4. We then proceeded to remove leftover non-human readable text such as HTML tags or base64 encodings, along URLs users may have posted in their discussions. 5. We now begin the dataset formatting process of compiling all 143.501 files left (threads) & ~2.1M posts in Parquet. 6. Final results yield approx 1bil characters on ~144k rows. #### Who are the source language producers? Self-described incels / members of the incels.is website (not to be taken in the mot-a-mot sense of the word) ### Personal and Sensitive Information Includes details of the users' (tragic & tragically self-perceived) lifes. No personal information contained in itself but touches on many sensitive subjects. ## Considerations for Using the Data Go wild with it. Keep in mind that we are not trying to expose, radicalize or even remotely harm this community. We have compiled almost 3 years worth of posts on this forum so we could better study this phenomena for a University project. We will be taking into consideration the actual publishing of the model trained on this data, but we do not see a potential scientific gain that would convince us to do so. ### Social Impact of Dataset Public Awareness and Education: Pro: Publishing a dataset might bring greater public awareness to the issue and could be used for educational purposes, enlightening people about the intricacies of this community. Greater understanding might foster empathy and encourage supportive interventions. Con: It might also inadvertently glamorize or sensationalize the community, leading to an increased interest in and potential growth of such ideologies. Source: Marwick, A., & Caplan, R. (2018). Drinking male tears: Language, the manosphere, and networked harassment. Feminist Media Studies, 18(4), 543-559. Potential Stigmatization and Alienation: Pro: Identifying problematic behaviors and attitudes can help professionals develop targeted interventions. Con: Generalizing or pathologizing the behaviors of this community might further stigmatize and alienate its members. Labeling can reinforce undesirable behavior if individuals internalize these negative identities. Source: Dovidio, J. F., Major, B., & Crocker, J. (2000). Stigma: Introduction and overview. In T. F. Heatherton, R. E. Kleck, M. R. Hebl, & J. G. Hull (Eds.), The social psychology of stigma (p. 1–28). Misuse of Data: Pro: When used responsibly, such a dataset can be a treasure trove for academic research. Con: However, there's always a risk of data being misused, misinterpreted, or cherry-picked to support harmful narratives or agendas. Source: boyd, d., & Crawford, K. (2012). Critical questions for big data. Information, Communication & Society, 15(5), 662-679. Ethical Concerns: Pro: Revealing problematic beliefs might serve a greater good. Con: There are ethical concerns, especially if data was collected without consent. Respect for individuals' autonomy and privacy is paramount in research ethics. (Data is collected under anonymity from a free-to-view, no-signup required, non-scrape blocking Forum - as per their ToS) Source: National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. (1979). The Belmont report: Ethical principles and guidelines for the protection of human subjects of research. Psychological Impact on Incels: Pro: Confronting one's views might lead to self-reflection and change. Con: Conversely, it might entrench their beliefs further if they feel attacked or misunderstood, a phenomenon supported by the backfire effect. Source: Nyhan, B., & Reifler, J. (2010). When corrections fail: The persistence of political misperceptions. Political Behavior, 32(2), 303-330. ### Discussion of Biases The authors compiled only the first 150.000 of the 270.000 threads in the "Inceldom discussion" part of the forum. As a consequence, older posts have been left out and the dataset may not thoroughly represent the full extent of incel discourse. The authors declare no further biases or conflicts of interest - the data was scraped and processed as it appears on the forum.
提供机构:
187ro
原始信息汇总

数据集卡片 for IncelSet

数据集概述

该数据集基于incels.is论坛,包含近3年的帖子,内容高度冒犯性,涉及主题包括(自称)独身主义、自我观点、生活改善(尝试或建议)、自杀、自我感知失败、对女性的看法、对社会的看法、对政治的看法——从成员的角度出发。

共同作者:inmate & curly,阿姆斯特丹大学政治、心理学、法律和经济(PPLE)

语言

英语,包含大量种族歧视、厌女、性侵犯提及和普遍仇恨内容——请勿查看或使用,以免感到冒犯。

数据集结构

数据集包含两列:"title"(代表帖子标题)和"text"(代表用户在帖子标题下的回复)

源数据

Incels.is论坛。

初始数据收集和规范化

  1. 首先使用GoLang编写脚本抓取incel.is论坛的所有内容。
  2. 下载约150,000个帖子,包含近210万个回复,大约9小时内完成,使用具有72核的专用服务器。
  3. 然后使用Python脚本处理数据并格式化为符合RFC 8259标准的JSON数据格式。
  4. 进行个人身份信息(PII)的移除,从而匿名化用户帖子。
  5. 移除剩余的非人类可读文本,如HTML标签或base64编码,以及用户在讨论中可能发布的URL。
  6. 开始编译所有剩余的143,501个文件(帖子)和约210万个回复为Parquet格式。
  7. 最终结果产生约10亿个字符,分布在约144,000行中。

源语言生产者

自称incels/incels.is网站的成员(非字面意义上的)

个人和敏感信息

包含用户(悲剧性和自我感知的)生活的细节。本身不包含个人信息,但涉及许多敏感主题。

使用数据的考虑

可以自由使用。请记住,我们不是试图揭露、激进化或以任何方式伤害这个社区。我们编译了近3年的论坛帖子,以便更好地研究这一现象,用于大学项目。我们将考虑实际发布基于此数据训练的模型的可能性,但我们不认为有潜在的科学收益会说服我们这样做。

数据集的社会影响

公众意识和教育

优点:发布数据集可能会提高公众对该问题的认识,并可用于教育目的,让人们了解这个社区的复杂性。更深入的理解可能会培养同理心并鼓励支持性干预。 缺点:它也可能无意中美化或耸人听闻这个社区,导致对这种意识形态的兴趣增加和潜在增长。

潜在的污名化和孤立

优点:识别有问题的行为和态度可以帮助专业人士开发有针对性的干预措施。 缺点:将这个社区的行为一般化或病理化可能会进一步污名化和孤立其成员。标签化可能会强化不良行为,如果个人内化了这些负面身份。

数据滥用

优点:当负责任地使用时,这样的数据集可以成为学术研究的宝库。 缺点:然而,总是存在数据被滥用、误解或挑选以支持有害叙事或议程的风险。

伦理问题

优点:揭示有问题的信仰可能服务于更大的利益。 缺点:存在伦理问题,特别是如果数据是在未经同意的情况下收集的。尊重个人的自主权和隐私在研究伦理中至关重要。

对Incels的心理影响

优点:面对自己的观点可能会导致自我反思和改变。 缺点:相反,如果他们感到被攻击或误解,这可能会进一步强化他们的信念,这种现象得到了回火效应的支持。

偏见的讨论

作者仅编译了论坛“Inceldom讨论”部分的前150,000个帖子,因此较旧的帖子被排除在外,数据集可能无法全面代表incel话语的全部范围。作者声明没有进一步的偏见或利益冲突——数据是按照论坛上的样子抓取和处理的。

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集名为incelset,主要用于文本生成和填充掩码任务,包含英文文本数据,格式为json,数据量在10万到100万之间。数据集标记为Not-For-All-Audiences,可能包含敏感或有害信息,使用时需谨慎。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作