SamW/HumanMOD
收藏Hugging Face2023-07-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SamW/HumanMOD
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
task_categories:
- text-classification
pretty_name: HumanMOD (AMCIS 2023)
size_categories:
- 10M<n<100M
---
We are excited to share the release of the HumanMOD dataset, unveiled in our [AMCIS 2023 paper ](https://aisel.aisnet.org/amcis2023/sig_aiaa/sig_aiaa/3).
Wang, Kanlun; Fu, Zhe; Zhou, Lina; and Zhang, Dongsong, "How Does User Engagement Support Content Moderation? A Deep Learning-based Comparative Study" (2023). AMCIS 2023 Proceedings. 3.
https://aisel.aisnet.org/amcis2023/sig_aiaa/sig_aiaa/3
Dataset Summary:
- The data collection was limited to public online communities to comply with the platform's privacy policy.
- We leveraged a [Pushshift Reddit API](https://reddit-api.readthedocs.io/en/latest/) to scrape posts from 40 subreddits daily across four different domains from Aug 24 to October 28, 2022, resulting in 104,674 posts.
- To enhance the ecological validity of the study findings, we used a [PRAW API](https://praw.readthedocs.io/en/stable/) to perform another round of data collection of the collected posts 2 months later to validate whether the post content was moderated or not.
- Thereafter, we used a snowballing approach to collect the corresponding comments on all the posts.
- The metadata includes post content, post time, comment content, comment time, karma score, etc.
- We set a threshold for the minimum number of comments to 2 and an upper bound for the number of direct comments to 15 to facilitate the extraction of graph-based structural information.
- The final dataset consists of 8,511 moderated posts and another 8,511 not moderated posts.
- All the posts were commented on, with a total of 148,344 comments.
Data Fields for HumanMOD_Posts dataset:
- Reddit_ID: the unique identifiers for Reddit posts, which serve as the foreign keys bridging to the HumanMOD_Comments dataset.
- Subreddits: the names of subreddits
- Titles: the titles of Reddit posts
- Body: post content, which is an extended description of a post
- Author: the authors of posts
- URLs: the web addresses of Reddit posts
- Labels: (0) -- The post is not moderated; (1) -- The post is moderated by moderators.
Data Fields for HumanMOD_Comments dataset:
- Parent ID: the unique identifiers for parent comments
- Comment ID: the unique identifiers for child comments, the ones that reply to the parent comments
- Comment Body: the content of child comments
- Score: the karma scores of child comments
- Author: the authors of child comments
- Post ID: the specific Reddit post identifiers to which the child comment should be associated, which serve as the foreign keys bridging to the HumanMOD_Posts dataset.
提供机构:
SamW
原始信息汇总
数据集概述
基本信息
- 名称: HumanMOD (AMCIS 2023)
- 语言: 英语
- 任务类别: 文本分类
- 数据集大小: 10M<n<100M
数据收集
- 收集时间: 2022年8月24日至10月28日
- 数据来源: 通过Pushshift Reddit API从40个不同领域的子论坛每日抓取帖子,共收集104,674篇帖子。
- 验证方法: 使用PRAW API在两个月后再次收集帖子数据,以验证内容是否被审核。
- 评论收集: 采用雪球法收集所有帖子的相应评论。
数据集详情
- 帖子数据: 包括8,511篇已审核帖子和8,511篇未审核帖子。
- 评论数据: 共有148,344条评论。
- 数据字段:
- HumanMOD_Posts:
- Reddit_ID
- Subreddits
- Titles
- Body
- Author
- URLs
- Labels (0: 未审核, 1: 已审核)
- HumanMOD_Comments:
- Parent ID
- Comment ID
- Comment Body
- Score
- Author
- Post ID
- HumanMOD_Posts:
数据处理
- 评论阈值: 设置最小评论数为2,直接评论数上限为15,以提取基于图的结构信息。
- 元数据: 包括帖子内容、发布时间、评论内容、评论时间、积分等。



