five

georeactor/reddit_one_ups_2014

收藏
Hugging Face2023-03-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/georeactor/reddit_one_ups_2014
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-classification tags: - reddit - not-for-all-eyes - not-for-all-audiences language: en --- # Dataset Card for reddit_one_ups_2014 ## Dataset Description - **Homepage:** https://github.com/Georeactor/reddit-one-ups ### Dataset Summary Reddit 'one-ups' or 'clapbacks' - replies which scored higher than the original comments. This task makes one-ups easier by focusing on a set of common, often meme-like replies (e.g. 'yes', 'nope', '(͡°͜ʖ͡°)'). For commentary on predictions with a previous version of the dataset, see https://blog.goodaudience.com/can-deepclapback-learn-when-to-lol-e4a2092a8f2c For unique / non-meme seq2seq version of this dataset, see https://huggingface.co/datasets/georeactor/reddit_one_ups_seq2seq_2014 Replies were selected from PushShift's archive of posts from 2014. ### Supported Tasks Text classification task: finding the common reply (out of ~37) to match the parent comment text. Text prediction task: estimating the vote score, or parent:reply ratio, of a meme response, as a measure of relevancy/cleverness of reply. ### Languages Primarily English - includes some emoticons such as ┬─┬ノ(ಠ_ಠノ) ## Dataset Structure ### Data Instances 29,375 rows ### Data Fields - id: the Reddit alphanumeric ID for the reply - body: the content of the original reply - score: the net vote score of the original reply - parent_id: the Reddit alphanumeric ID for the parent - author: the Reddit username of the reply - subreddit: the Reddit community where the discussion occurred - parent_score: the net vote score of the parent comment - cleantext: the simplified reply (one of 37 classes) - tstamp: the timestamp of the reply - parent_body: the content of the original parent ## Dataset Creation ### Source Data Reddit comments collected through PushShift.io archives for 2014. #### Initial Data Collection and Normalization - Removed deleted or empty comments. - Selected only replies which scored 1.5x higher than a parent comment, where both have a positive score. - Found the top/repeating phrases common to these one-ups/clapback comments. - Selected only replies which had one of these top/repeating phrases. - Made rows in PostgreSQL and output as CSV. ## Considerations for Using the Data Comments and responses in the Reddit archives and output datasets all include NSFW and otherwise toxic language and links! - You can use the subreddit and score columns to filter content. - Imbalanced dataset: replies 'yes' and 'no' are more common than others. - Overlap of labels: replies such as 'yes', 'yep', and 'yup' serve similar purposes; in other cases 'no' vs. 'nope' may be interesting. - Timestamps: the given timestamp may help identify trends in meme replies - Usernames: a username was included to identify the 'username checks out' meme, but this was not common enough in 2014, and the included username is from the reply. Reddit comments are properties of Reddit and comment owners using their Terms of Service.
提供机构:
georeactor
原始信息汇总

数据集概述:reddit_one_ups_2014

数据集描述

数据集总结

  • 数据集包含Reddit上的“one-ups”或“clapbacks”,即回复得分高于原评论的情况。
  • 主要关注一组常见的、类似模因的回复,如“yes”, “nope”, “(͡°͜ʖ͡°)”。

支持的任务

  • 文本分类任务:从约37种常见回复中找到匹配父评论的回复。
  • 文本预测任务:估计模因回复的投票分数或父回复比率,作为回复相关性/机智程度的衡量。

语言

  • 主要为英语,包含一些表情符号如┬─┬ノ(ಠ_ಠノ)。

数据集结构

数据实例

  • 共有29,375行数据。

数据字段

  • id:回复的Reddit字母数字ID。
  • body:原始回复的内容。
  • score:原始回复的净投票分数。
  • parent_id:父评论的Reddit字母数字ID。
  • author:回复的Reddit用户名。
  • subreddit:讨论发生的Reddit社区。
  • parent_score:父评论的净投票分数。
  • cleantext:简化的回复(37个类别之一)。
  • tstamp:回复的时间戳。
  • parent_body:原始父评论的内容。

数据集创建

源数据

  • 数据来源于PushShift.io的2014年Reddit评论档案。

初始数据收集和规范化

  • 移除已删除或空白的评论。
  • 仅选择得分是父评论1.5倍以上的回复,且两者得分均为正。
  • 识别这些“one-ups”/“clapback”评论中常见的顶级/重复短语。
  • 仅选择包含这些顶级/重复短语的回复。
  • 使用PostgreSQL创建行并输出为CSV格式。

使用数据的考虑因素

  • 数据集中的评论和回复包含NSFW和有毒语言及链接。
  • 可以使用subreddit和score字段来过滤内容。
  • 数据集不平衡,回复如“yes”和“no”比其他回复更常见。
  • 标签重叠:例如“yes”, “yep”, “yup”功能相似;“no”与“nope”的对比可能有趣。
  • 时间戳:提供的时间戳可能有助于识别模因回复的趋势。
  • 用户名:包含用户名以识别“username checks out”模因,但这种情况在2014年不够常见。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作