five

SamW/HumanMOD

收藏
Hugging Face2023-07-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SamW/HumanMOD
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en task_categories: - text-classification pretty_name: HumanMOD (AMCIS 2023) size_categories: - 10M<n<100M --- We are excited to share the release of the HumanMOD dataset, unveiled in our [AMCIS 2023 paper ](https://aisel.aisnet.org/amcis2023/sig_aiaa/sig_aiaa/3). Wang, Kanlun; Fu, Zhe; Zhou, Lina; and Zhang, Dongsong, "How Does User Engagement Support Content Moderation? A Deep Learning-based Comparative Study" (2023). AMCIS 2023 Proceedings. 3. https://aisel.aisnet.org/amcis2023/sig_aiaa/sig_aiaa/3 Dataset Summary: - The data collection was limited to public online communities to comply with the platform's privacy policy. - We leveraged a [Pushshift Reddit API](https://reddit-api.readthedocs.io/en/latest/) to scrape posts from 40 subreddits daily across four different domains from Aug 24 to October 28, 2022, resulting in 104,674 posts. - To enhance the ecological validity of the study findings, we used a [PRAW API](https://praw.readthedocs.io/en/stable/) to perform another round of data collection of the collected posts 2 months later to validate whether the post content was moderated or not. - Thereafter, we used a snowballing approach to collect the corresponding comments on all the posts. - The metadata includes post content, post time, comment content, comment time, karma score, etc. - We set a threshold for the minimum number of comments to 2 and an upper bound for the number of direct comments to 15 to facilitate the extraction of graph-based structural information. - The final dataset consists of 8,511 moderated posts and another 8,511 not moderated posts. - All the posts were commented on, with a total of 148,344 comments. Data Fields for HumanMOD_Posts dataset: - Reddit_ID: the unique identifiers for Reddit posts, which serve as the foreign keys bridging to the HumanMOD_Comments dataset. - Subreddits: the names of subreddits - Titles: the titles of Reddit posts - Body: post content, which is an extended description of a post - Author: the authors of posts - URLs: the web addresses of Reddit posts - Labels: (0) -- The post is not moderated; (1) -- The post is moderated by moderators. Data Fields for HumanMOD_Comments dataset: - Parent ID: the unique identifiers for parent comments - Comment ID: the unique identifiers for child comments, the ones that reply to the parent comments - Comment Body: the content of child comments - Score: the karma scores of child comments - Author: the authors of child comments - Post ID: the specific Reddit post identifiers to which the child comment should be associated, which serve as the foreign keys bridging to the HumanMOD_Posts dataset.
提供机构:
SamW
原始信息汇总

数据集概述

基本信息

  • 名称: HumanMOD (AMCIS 2023)
  • 语言: 英语
  • 任务类别: 文本分类
  • 数据集大小: 10M<n<100M

数据收集

  • 收集时间: 2022年8月24日至10月28日
  • 数据来源: 通过Pushshift Reddit API从40个不同领域的子论坛每日抓取帖子,共收集104,674篇帖子。
  • 验证方法: 使用PRAW API在两个月后再次收集帖子数据,以验证内容是否被审核。
  • 评论收集: 采用雪球法收集所有帖子的相应评论。

数据集详情

  • 帖子数据: 包括8,511篇已审核帖子和8,511篇未审核帖子。
  • 评论数据: 共有148,344条评论。
  • 数据字段:
    • HumanMOD_Posts:
      • Reddit_ID
      • Subreddits
      • Titles
      • Body
      • Author
      • URLs
      • Labels (0: 未审核, 1: 已审核)
    • HumanMOD_Comments:
      • Parent ID
      • Comment ID
      • Comment Body
      • Score
      • Author
      • Post ID

数据处理

  • 评论阈值: 设置最小评论数为2,直接评论数上限为15,以提取基于图的结构信息。
  • 元数据: 包括帖子内容、发布时间、评论内容、评论时间、积分等。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作