RyokoAI/Honeyfeed3600
收藏Hugging Face2023-04-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RyokoAI/Honeyfeed3600
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- novel
- training
- story
task_categories:
- text-classification
- text-generation
pretty_name: Honeyfeed3600
size_categories:
- 1K<n<10K
---
# Dataset Card for Honeyfeed3600
*The BigKnow2022 dataset and its subsets are not yet complete. Not all information here may be accurate or accessible.*
## Dataset Description
- **Homepage:** (TODO)
- **Repository:** <https://github.com/RyokoAI/BigKnow2022>
- **Paper:** N/A
- **Leaderboard:** N/A
- **Point of Contact:** Ronsor/undeleted <ronsor@ronsor.com>
### Dataset Summary
Honeyfeed3600 is a dataset consisting of text from over 38,000 chapters across approximately 3,600 series posted on the
English-language web novel site [Honeyfeed](https://www.honeyfeed.fm).
### Supported Tasks and Leaderboards
This dataset is primarily intended for unsupervised training of text generation models; however, it may be useful for other purposes.
* text-classification
* text-generation
### Languages
* English
## Dataset Structure
### Data Instances
```json
{
"text": "Dark, black, nothingness. There are so many ways to describe that hole, but nothing would get me down there...","
"meta": {
"subset": "honeyfeed",
"themes": [],
"my_themes": [],
"prompt": "",
"author": "Lucianael",
"novel": "10009",
"id": "55686",
"title": "13 Steps - 13 Steps",
"likes": 4,
"views": 21,
"q": 0.5999999999999999
}
}
```
### Data Fields
* `text`: the actual chapter text
* `meta`: novel and chapter metadata
* `subset`: dataset tag: `honeyfeed`
* `lang`: dataset language: `en` (English)
* `themes`: array of novel themes
* `my_themes`: array of additional novel themes
* `prompt`: writing prompt
* `author`: author name
* `novel`: novel ID
* `id`: chapter ID
* `title`: novel and chapter title in the form `<chapter title> - <novel title>`
* `likes`: novel like count
* `views`: novel view count
* `q`: q-score (quality score)
#### Q-Score Distribution
```
0.00: 499
0.10: 420
0.20: 2562
0.30: 0
0.40: 0
0.50: 13344
0.60: 9021
0.70: 5997
0.80: 4217
0.90: 1931
1.00: 801
```
### Data Splits
No splitting of the data was performed.
## Dataset Creation
### Curation Rationale
TODO
### Source Data
#### Initial Data Collection and Normalization
TODO
#### Who are the source language producers?
The authors of each novel.
### Annotations
#### Annotation process
Chapter and novel titles were scraped alongside chapter text.
#### Who are the annotators?
No human annotators.
### Personal and Sensitive Information
The dataset contains only works of fiction, and we do not believe it contains any PII.
## Considerations for Using the Data
### Social Impact of Dataset
This dataset is intended to be useful for anyone who wishes to train a model to generate "more entertaining" content.
It may also be useful for other languages depending on your language model.
### Discussion of Biases
This dataset is composed of fictional works by various authors. Because of this fact, the contents of this dataset will reflect
the biases of those authors. Beware of stereotypes.
### Other Known Limitations
N/A
## Additional Information
### Dataset Curators
Ronsor Labs
### Licensing Information
Apache 2.0, for all parts of which Ronsor Labs or the Ryoko AI Production Committee may be considered authors. All other material is
distributed under fair use principles.
### Citation Information
```
@misc{ryokoai2023-bigknow2022,
title = {BigKnow2022: Bringing Language Models Up to Speed},
author = {Ronsor},
year = {2023},
howpublished = {\url{https://github.com/RyokoAI/BigKnow2022}},
}
```
### Contributions
Thanks to @ronsor (GH) for gathering this dataset.
提供机构:
RyokoAI
原始信息汇总
数据集概述
数据集名称
- Honeyfeed3600
数据集描述
- 内容来源:包含来自英语网络小说网站Honeyfeed的超过38,000章节,涉及约3,600个系列。
- 主要用途:用于无监督文本生成模型的训练,也可用于其他文本相关任务。
数据集结构
- 数据实例:每个实例包含文本内容和元数据。
- 文本:章节的具体内容。
- 元数据:包括子集标签、语言、主题、作者、小说ID、章节ID、标题、点赞数、浏览量和质量评分(q-score)。
- 数据字段:
text: 章节文本内容。meta: 小说和章节元数据,包括:subset: 数据集标签,固定为honeyfeed。lang: 语言,固定为en(英语)。themes: 小说主题列表。my_themes: 额外的小说主题列表。prompt: 写作提示。author: 作者名称。novel: 小说ID。id: 章节ID。title: 章节标题与小说标题的组合。likes: 点赞数。views: 浏览量。q: 质量评分(q-score)。
- Q-Score分布:详细列出了不同q-score值的实例数量。
数据集语言
- 英语
许可证
- Apache 2.0
联系信息
- 联系人:Ronsor/undeleted ronsor@ronsor.com
数据集创建
- 源数据生产者:各小说的作者。
- 注释过程:自动从网站抓取章节和小说标题。
- 敏感信息:数据集仅包含虚构作品,不包含个人敏感信息。
使用注意事项
- 社会影响:旨在帮助训练生成“更娱乐”的内容的模型。
- 偏见讨论:数据集内容反映作者的偏见,需注意避免刻板印象。



