RyokoExtra/FallingThroughTheSkies-592k
收藏Hugging Face2023-08-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RyokoExtra/FallingThroughTheSkies-592k
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-classification
- text-generation
- text2text-generation
- text-retrieval
tags:
- not-for-all-audiences
pretty_name: Falling through the Skies
size_categories:
- 100K<n<1M
language:
- en
license:
- mit
multilinguality:
- monolingual
---
# Dataset Card for Falling through the Skies
## Dataset Description
- **Homepage:** Nowhere else except here.
- **Repository:** Here!
- **Paper:** Nil
- **Leaderboard:** Nil
- **Point of Contact:** KaraKaraWitch
### Dataset Summary
*Falling through the Skies* is an unfiltered dump of **592k stories** literotica.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
The dataset is mainly targetted at english. However, there could be chances where the language may not be english
## Dataset Structure
### Data Instances
Within the 7z file, contains a folder containing each chapter/short story. Each file is formatted in json and should be intuitive in terms of what are the values stand for.
The important keys are:
`content`: Contains the story/chapter content. If the story is paged, it has been collapsed and seperated with 2 new line characters: `<Page 1>\n\n<Page 2>`
`meta`: Consider using this field instead of related field in the root. `"view_count"` > `meta > view_count` for example
### Data Fields
There is too many to list but there is a few key fields that you might be intrested (They are located in meta key):
- `tag`: Author tags for the work.
- `views`: Number of views at the time of scrape
- `words`: Number of words for the work
- `category`: Category where the story/chapter is slotted in
- `writers_pick`: If the work has been selected, it will be set to `true`
- `comment_count`: The number of comments for this story/chapter. do note however that comments content are not captured.
- `desc`: The short blub to introduce the story/chapter.
- `author`: Author for the story/chapter.
The following keys are located in the root:
- `url`: The url slug. Example: `https://<Site Redacted>/s/<url>`.
- `title`: The title of the story/chapter.
- `content`: The contents for the story/chapter.
### Data Splits
No datasplits as the total size can be stored within a filesystem.
## Dataset Creation
### Curation Rationale
Falling through the Skies dataset was concieved due to the lack of public dataset focused on erotic literature.
### Source Data
#### Initial Data Collection and Normalization
An inital scrape based on categories was done to scrape all the public stories. After that, we used the public json api present on the beta version of the site to further scrape all the pages and stories.
#### Who are the source language producers?
The respective authors of each chapter/short story.
### Annotations
#### Annotation process
All the data presented are parsed by scripts provided via the beta api.
#### Who are the annotators?
No human annotators were present.
### Personal and Sensitive Information
The dataset contains only works of fiction, and we do not believe it contains any PII. However, usernames and author biographies could leak PII infomation.
## Considerations for Using the Data
### Social Impact of Dataset
This dataset is intended to be useful for anyone who wishes to train a model to generate "more entertaining" content. It may also be useful for other languages depending on your language model.
### Discussion of Biases
This dataset is composed of fictional works by various authors. Because of this fact, the contents of this dataset will reflect the biases of those authors. **This dataset is only on NSFW material and was not filtered. Beware of stereotypes.**
### Other Known Limitations
Not Applicable
## Additional Information
### Dataset Curators
KaraKaraWitch
### Licensing Information
[More Information Needed]
### Citation Information
Apache 2.0, for all parts of which Ronsor Labs or the Ryoko AI Production Committee may be considered authors. All other material is distributed under fair use principles.
### Contributions
```
@misc{kkr-ftts,
title = {Falling through the Skies: Literature on unseen unique data},
author = {KaraKaraWitch},
year = {2023},
howpublished = {\url{https://huggingface.co/datasets/RyokoAI/FallingThroughTheSkies-592k}},
}
```
提供机构:
RyokoExtra
原始信息汇总
数据集概述
名称: Falling through the Skies
任务类别:
- 文本分类
- 文本生成
- 文本到文本生成
- 文本检索
标签: 不适合所有观众
大小: 100K<n<1M
语言: 主要为英语
许可证: MIT
多语言性: 单语种
数据集描述
概述: Falling through the Skies 是一个包含592k故事的未过滤文学数据集。
数据结构:
- 数据实例: 数据存储在7z文件中,每个文件夹包含一个章节/短故事。文件格式为JSON。
- 重要键:
content: 包含故事/章节内容。meta: 包含多个关键字段,如tag,views,words,category,writers_pick,comment_count,desc,author。- 根级键:
url,title,content。
数据创建:
- 采集理由: 由于缺乏专注于情色文学的公开数据集而创建。
- 源数据: 初始数据通过网站的公开JSON API收集。
- 注释: 数据由API提供的脚本解析,无人工注释。
使用考虑:
- 社会影响: 旨在帮助训练模型生成“更娱乐”的内容。
- 偏见讨论: 数据集内容反映作者的偏见,包含NSFW材料且未过滤。
附加信息
数据集创建者: KaraKaraWitch
许可证信息: MIT
引用信息:
@misc{kkr-ftts, title = {Falling through the Skies: Literature on unseen unique data}, author = {KaraKaraWitch}, year = {2023}, howpublished = {url{https://huggingface.co/datasets/RyokoAI/FallingThroughTheSkies-592k}}, }



