steamcyclone/Pill-Ideologies-New-Test
收藏Hugging Face2024-02-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/steamcyclone/Pill-Ideologies-New-Test
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
language:
- en # Example: fr
tags:
- natural-language-understanding # Example: audio
- ideology classification #
- text classification #
annotations_creators:
- crowdsourced # Example: crowdsourced, found, expert-generated, machine-generated
language_creators:
- crowdsourced # Example: crowdsourced, ...
# language_details:
# - en-US # Example: fr-FR
pretty_name: PiLls # Example: SQuAD
size_categories:
- n<10K # Example: n<1K, 100K<n<1M, …
source_datasets:
- reddit # Example: wikipedia
task_categories: # Full list at https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/pipelines.ts
- text-classification # Example: question-answering
task_ids:
- multi-class-classification # Example: extractive-qa
---
---
# Dataset Card for Dataset Name
<!-- Provide a quick summary of the dataset. -->
This dataset aims to be a tool to help trace linguistic patterns in the reddit posts from members who partake in the internet centric pill ideologies, known as blackpill, red pill, blue pill.
## Dataset Details
### Dataset Description
A few of the major groups' posts have been coalesced into one dataset, all from different years. There are more than 200 posts per the major pill groups on reddit (red pill rebooted, blue pill, black pill, married red pill, red pill women, and feminism as a counterpoint of reference). The group of feminism was added as a juxtaposition against red pill women, in oder to allow researchers to explore those dichotomies. For researchers, the value will be in identifying or classifying the types of words that make one ideology more prominent than the other.
- **Curated by:** [steamcyclone]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [steamcyclone]
- **Language(s) (NLP):** [EN]
- **License:** [CC]
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [This is the only source]
## Uses
The main usage of this dataset is to study linguistic patterns. Running models and detecting word usage per groups, as well as overlaps across groups is an ideal use for this dataset. With the rise of the loneliness epidemic, any insights that come from this are welcome.
### Direct Use
The suitable use cases are to multi-class classification, word clustering or semantic clustering per different groups, summarization modeling, text parsing, and any other natural language processing task.
[More Information Needed]
### Out-of-Scope Use
This dataset is not meant to be utilized to demonize or mock certain online communities for the trials in life in which individuals find themselves. If the viewer's agenda is to push forward some misandrist or misogynistic agenda, please ignore this dataset.
[More Information Needed]
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
Currently, this dataset contains
- subreddit of the post : string,
- postid : string
- title of the post: string
- text of the post (where applicable) : string
- url (if something was embedded) : string\
- score : int32
- author : string
- date : int64
[More Information Needed]
## Dataset Creation
### Curation Rationale
With the rise of the loneliness epidemic and the radicalization of internet content pitting men and women against each other, it is important to seek understanding of the root of the problem. Depending on whom you ask, you'll get a plethora of answers. Jordan Peterson describes it as some type of post-modernist feminist liberalism problem. The Andrew Tates and other conservative archetypes blame the loss of traditionalism. Others blame dating apps and its selection bias effects. Within each of the major pill ideologies, with the exception of the BlackPill, men blame women, and women blame men.
Unfortunately, male spaces, as substantiated by research and media coverage, in recent years have only been able to exist on the internet, and counter-spaces have emerged to challenge the views held in the differing ideologies.
In short, according to archetypical definitions
- the red pill is the emancipation of the masculinity in a feminized age and understanding mating strategies with women.
- the blue pill is the satire of the red pill, often run by women.
- the black pill is meant to bridge the gaps across the red, pink, and blue pills in order to land on a ground truth.
- the pink pill is about improving the female image by augmenting sexual marketplace value.
[More Information Needed]
### Source Data
Each record contains a reddit post, approximately 200 per group, and has a key title and a post with words to display the intended message by the author.
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
In progress.
However, the plan is to increase the amount of records and leverage the ChatGpt API to summarize the messages into categories. In addition, the dates have to be cleaned a little, in order to add use for researches. I am also not sure if I can retrieve comments per post, further augmenting the data.
[More Information Needed]
#### Who are the source data producers?
The producers of the data are the various redditors who have participated in these spaces.
[More Information Needed]
### Annotations [optional]
An annotation that is not part of the collection will be the ChatGPT summarizations (future). The subreddit labels are merely the origins of the posts.
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
The origin of the posts are the labels of the records.
#### Who are the annotators?
I and the subreddit origin are the label annotators.
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
This dataset contains no personally identifiable information with the exception of embedded youtube links. Those links may lead to videos where the impact of the content is unknown.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
A major caveat is that the pink pill and original red pill groups are shadow banned, impeding their scraping process. This is a flaw I recognize because the original red pill movement, which started in books by authors, propagated itself through its internet (reddit) variant, and it spawned all the other pills.
Another bias point is that there is more red pill content, as a means to compensate for the ban of the original red pill subreddit.
As such, I caution researchers to balance their datasets where necessary.
[More Information Needed]
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset. Remember that this dataset is not a tool for reckless and hateful political agendas.
## Citation [optional]
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
Pill ideologies :
In short, according to archetypical definitions
- the red pill is the emancipation of the masculinity in a feminized age and understanding mating strategies with women.
- the blue pill is the satire of the red pill, often run by women.
- the black pill is meant to bridge the gaps across the red, pink, and blue pills in order to land on a ground truth.
- the pink pill is about improving the female image by augmenting sexual marketplace value.
## Dataset Card Authors [optional]
steamcyclone, all the redditors from the subreddits in the authors columns.
## Dataset Card Contact
- N/A
提供机构:
steamcyclone
原始信息汇总
数据集卡片:PiLls
数据集概述
该数据集旨在帮助追踪Reddit帖子中参与互联网中心主义(如黑药丸、红药丸、蓝药丸)的用户在语言模式上的差异。主要包含来自不同年份的多个主要药丸组(红药丸重启、蓝药丸、黑药丸、已婚红药丸、红药丸女性和作为对比的参考点女性主义)的帖子。女性主义组被添加作为红药丸女性的对比,以便研究人员探索这些对立面。
数据集详情
数据集描述
- 语言(NLP): 英语
- 许可证: CC
- 数据集大小: n<10K
- 来源数据集: Reddit
- 任务类别: 文本分类
- 任务ID: 多类分类
数据集结构
数据集包含以下字段:
- subreddit:帖子所属的子版块,字符串类型
- postid:帖子ID,字符串类型
- title:帖子标题,字符串类型
- text:帖子内容(如果适用),字符串类型
- url:嵌入的内容链接(如果有),字符串类型
- score:帖子得分,整数类型
- author:帖子作者,字符串类型
- date:帖子日期,整数类型
数据集创建
数据收集和处理
计划增加记录数量并利用ChatGPT API将消息总结为类别。日期需要进行一些清理,以便为研究人员增加使用价值。尚不确定是否可以检索每个帖子的评论,以进一步增强数据。
数据生产者
数据的生产者是参与这些Reddit空间的各种Reddit用户。
标注
标注过程
帖子的来源是记录的标签。
标注者
我和子版块来源是标签标注者。
个人和敏感信息
数据集不包含个人身份信息,除了嵌入的YouTube链接。这些链接可能指向内容影响未知。
偏差、风险和限制
主要警告是粉药丸和原始红药丸组被禁止,阻碍了抓取过程。这是我认为的一个缺陷,因为原始红药丸运动通过其互联网(Reddit)变体传播,并产生了所有其他药丸。另一个偏差点是红药丸内容更多,作为对原始红药丸子版块禁止的补偿。因此,我建议研究人员在必要时平衡他们的数据集。
推荐
用户应意识到数据集的风险、偏差和技术限制。请记住,此数据集不是用于鲁莽和仇恨政治议程的工具。



