huanggab/reddit_haiku

Name: huanggab/reddit_haiku
Creator: huanggab
Published: 2022-11-18 20:02:29
License: 暂无描述

Hugging Face2022-11-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/huanggab/reddit_haiku

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language: - en language_creators: - found license: - unknown multilinguality: - monolingual pretty_name: English haiku dataset scraped from Reddit's /r/haiku with topics extracted using KeyBERT size_categories: - 10K<n<100K source_datasets: - extended|other tags: - haiku - poem - poetry - reddit - keybert - generation task_categories: - text-generation task_ids: - language-modeling --- # Dataset Card for "Reddit Haiku" This dataset contains haikus from the subreddit [/r/haiku](https://www.reddit.com/r/haiku/) scraped and filtered between October 19th and 10th 2022, combined with a [previous dump](https://zissou.infosci.cornell.edu/convokit/datasets/subreddit-corpus/corpus-zipped/hackintosh_ja~-~hamsters/) of that same subreddit packaged by [ConvoKit](https://convokit.cornell.edu/documentation/subreddit.html) as part of the Subreddit Corpus, which is itself a subset of [pushshift.io](https://pushshift.io/)'s big dump. A main motivation for this dataset was to collect an alternative haiku dataset for evaluation, in particular for evaluating Fabian Mueller's Deep Haiku [model](fabianmmueller/deep-haiku-gpt-j-6b-8bit) which was trained on the Haiku datasets of [hjhalani30](https://www.kaggle.com/datasets/hjhalani30/haiku-dataset) and [bfbarry](https://www.kaggle.com/datasets/bfbarry/haiku-dataset), which are also available on [huggingface hub](https://huggingface.co/datasets/statworx/haiku). ## Fields The fields are post id (`id`), the content of the haiku (`processed_title`), upvotes (`ups`), and topic keywords (`keywords`). Topic keywords for each haiku have been extracted with the [KeyBERT library](https://maartengr.github.io/KeyBERT/guides/quickstart.html) and truncated to top-5 keywords. ## Usage This dataset is intended for evaluation, hence there is only one split which is `test`. ```python from datasets import load_dataset d=load_dataset('huanggab/reddit_haiku', data_files='test':'merged_with_keywords.csv'}) # use data_files or it will result in error >>> print(d['train'][0]) #{'Unnamed: 0': 0, 'id': '1020ac', 'processed_title': "There's nothing inside/There is nothing outside me/I search on in hope.", 'ups': 5, 'keywords': "[('inside', 0.5268), ('outside', 0.3751), ('search', 0.3367), ('hope', 0.272)]"} ``` There is code for scraping and processing in `processing_code`, and a subset of the data with more fields such as author Karma, downvotes and posting time at `processing_code/reddit-2022-10-20-dump.csv`.

提供机构：

huanggab

原始信息汇总

数据集概述

基本信息

名称: Reddit Haiku
语言: 英语
多语言性: 单语种
许可证: 未知
大小: 10K<n<100K
来源: 扩展自其他数据集
标签: haiku, poem, poetry, reddit, keybert, generation
任务类别: 文本生成
任务ID: 语言建模

数据集内容

数据来源: 从Reddit的/r/haiku子论坛抓取，并使用KeyBERT提取主题
数据时间范围: 2022年10月19日至10日
数据合并: 与之前的/r/haiku数据集合并
字段:
- id: 帖子ID
- processed_title: 俳句内容
- ups: 点赞数
- keywords: 主题关键词（使用KeyBERT提取，截断至前5个关键词）

使用目的

主要用途: 评估，特别是用于评估Fabian Mueller的Deep Haiku模型
数据集分割: 仅包含测试集

5,000+

优质数据集

54 个

任务类型

进入经典数据集