five

RyokoAI/Honeyfeed3600

收藏
Hugging Face2023-04-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RyokoAI/Honeyfeed3600
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - novel - training - story task_categories: - text-classification - text-generation pretty_name: Honeyfeed3600 size_categories: - 1K<n<10K --- # Dataset Card for Honeyfeed3600 *The BigKnow2022 dataset and its subsets are not yet complete. Not all information here may be accurate or accessible.* ## Dataset Description - **Homepage:** (TODO) - **Repository:** <https://github.com/RyokoAI/BigKnow2022> - **Paper:** N/A - **Leaderboard:** N/A - **Point of Contact:** Ronsor/undeleted <ronsor@ronsor.com> ### Dataset Summary Honeyfeed3600 is a dataset consisting of text from over 38,000 chapters across approximately 3,600 series posted on the English-language web novel site [Honeyfeed](https://www.honeyfeed.fm). ### Supported Tasks and Leaderboards This dataset is primarily intended for unsupervised training of text generation models; however, it may be useful for other purposes. * text-classification * text-generation ### Languages * English ## Dataset Structure ### Data Instances ```json { "text": "Dark, black, nothingness. There are so many ways to describe that hole, but nothing would get me down there..."," "meta": { "subset": "honeyfeed", "themes": [], "my_themes": [], "prompt": "", "author": "Lucianael", "novel": "10009", "id": "55686", "title": "13 Steps - 13 Steps", "likes": 4, "views": 21, "q": 0.5999999999999999 } } ``` ### Data Fields * `text`: the actual chapter text * `meta`: novel and chapter metadata * `subset`: dataset tag: `honeyfeed` * `lang`: dataset language: `en` (English) * `themes`: array of novel themes * `my_themes`: array of additional novel themes * `prompt`: writing prompt * `author`: author name * `novel`: novel ID * `id`: chapter ID * `title`: novel and chapter title in the form `<chapter title> - <novel title>` * `likes`: novel like count * `views`: novel view count * `q`: q-score (quality score) #### Q-Score Distribution ``` 0.00: 499 0.10: 420 0.20: 2562 0.30: 0 0.40: 0 0.50: 13344 0.60: 9021 0.70: 5997 0.80: 4217 0.90: 1931 1.00: 801 ``` ### Data Splits No splitting of the data was performed. ## Dataset Creation ### Curation Rationale TODO ### Source Data #### Initial Data Collection and Normalization TODO #### Who are the source language producers? The authors of each novel. ### Annotations #### Annotation process Chapter and novel titles were scraped alongside chapter text. #### Who are the annotators? No human annotators. ### Personal and Sensitive Information The dataset contains only works of fiction, and we do not believe it contains any PII. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is intended to be useful for anyone who wishes to train a model to generate "more entertaining" content. It may also be useful for other languages depending on your language model. ### Discussion of Biases This dataset is composed of fictional works by various authors. Because of this fact, the contents of this dataset will reflect the biases of those authors. Beware of stereotypes. ### Other Known Limitations N/A ## Additional Information ### Dataset Curators Ronsor Labs ### Licensing Information Apache 2.0, for all parts of which Ronsor Labs or the Ryoko AI Production Committee may be considered authors. All other material is distributed under fair use principles. ### Citation Information ``` @misc{ryokoai2023-bigknow2022, title = {BigKnow2022: Bringing Language Models Up to Speed}, author = {Ronsor}, year = {2023}, howpublished = {\url{https://github.com/RyokoAI/BigKnow2022}}, } ``` ### Contributions Thanks to @ronsor (GH) for gathering this dataset.
提供机构:
RyokoAI
原始信息汇总

数据集概述

数据集名称

  • Honeyfeed3600

数据集描述

  • 内容来源:包含来自英语网络小说网站Honeyfeed的超过38,000章节,涉及约3,600个系列。
  • 主要用途:用于无监督文本生成模型的训练,也可用于其他文本相关任务。

数据集结构

  • 数据实例:每个实例包含文本内容和元数据。
    • 文本:章节的具体内容。
    • 元数据:包括子集标签、语言、主题、作者、小说ID、章节ID、标题、点赞数、浏览量和质量评分(q-score)。
  • 数据字段
    • text: 章节文本内容。
    • meta: 小说和章节元数据,包括:
      • subset: 数据集标签,固定为honeyfeed
      • lang: 语言,固定为en(英语)。
      • themes: 小说主题列表。
      • my_themes: 额外的小说主题列表。
      • prompt: 写作提示。
      • author: 作者名称。
      • novel: 小说ID。
      • id: 章节ID。
      • title: 章节标题与小说标题的组合。
      • likes: 点赞数。
      • views: 浏览量。
      • q: 质量评分(q-score)。
  • Q-Score分布:详细列出了不同q-score值的实例数量。

数据集语言

  • 英语

许可证

  • Apache 2.0

联系信息

数据集创建

  • 源数据生产者:各小说的作者。
  • 注释过程:自动从网站抓取章节和小说标题。
  • 敏感信息:数据集仅包含虚构作品,不包含个人敏感信息。

使用注意事项

  • 社会影响:旨在帮助训练生成“更娱乐”的内容的模型。
  • 偏见讨论:数据集内容反映作者的偏见,需注意避免刻板印象。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作