five

RyokoExtra/FallingThroughTheSkies-592k

收藏
Hugging Face2023-08-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RyokoExtra/FallingThroughTheSkies-592k
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-classification - text-generation - text2text-generation - text-retrieval tags: - not-for-all-audiences pretty_name: Falling through the Skies size_categories: - 100K<n<1M language: - en license: - mit multilinguality: - monolingual --- # Dataset Card for Falling through the Skies ## Dataset Description - **Homepage:** Nowhere else except here. - **Repository:** Here! - **Paper:** Nil - **Leaderboard:** Nil - **Point of Contact:** KaraKaraWitch ### Dataset Summary *Falling through the Skies* is an unfiltered dump of **592k stories** literotica. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset is mainly targetted at english. However, there could be chances where the language may not be english ## Dataset Structure ### Data Instances Within the 7z file, contains a folder containing each chapter/short story. Each file is formatted in json and should be intuitive in terms of what are the values stand for. The important keys are: `content`: Contains the story/chapter content. If the story is paged, it has been collapsed and seperated with 2 new line characters: `<Page 1>\n\n<Page 2>` `meta`: Consider using this field instead of related field in the root. `"view_count"` > `meta > view_count` for example ### Data Fields There is too many to list but there is a few key fields that you might be intrested (They are located in meta key): - `tag`: Author tags for the work. - `views`: Number of views at the time of scrape - `words`: Number of words for the work - `category`: Category where the story/chapter is slotted in - `writers_pick`: If the work has been selected, it will be set to `true` - `comment_count`: The number of comments for this story/chapter. do note however that comments content are not captured. - `desc`: The short blub to introduce the story/chapter. - `author`: Author for the story/chapter. The following keys are located in the root: - `url`: The url slug. Example: `https://<Site Redacted>/s/<url>`. - `title`: The title of the story/chapter. - `content`: The contents for the story/chapter. ### Data Splits No datasplits as the total size can be stored within a filesystem. ## Dataset Creation ### Curation Rationale Falling through the Skies dataset was concieved due to the lack of public dataset focused on erotic literature. ### Source Data #### Initial Data Collection and Normalization An inital scrape based on categories was done to scrape all the public stories. After that, we used the public json api present on the beta version of the site to further scrape all the pages and stories. #### Who are the source language producers? The respective authors of each chapter/short story. ### Annotations #### Annotation process All the data presented are parsed by scripts provided via the beta api. #### Who are the annotators? No human annotators were present. ### Personal and Sensitive Information The dataset contains only works of fiction, and we do not believe it contains any PII. However, usernames and author biographies could leak PII infomation. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is intended to be useful for anyone who wishes to train a model to generate "more entertaining" content. It may also be useful for other languages depending on your language model. ### Discussion of Biases This dataset is composed of fictional works by various authors. Because of this fact, the contents of this dataset will reflect the biases of those authors. **This dataset is only on NSFW material and was not filtered. Beware of stereotypes.** ### Other Known Limitations Not Applicable ## Additional Information ### Dataset Curators KaraKaraWitch ### Licensing Information [More Information Needed] ### Citation Information Apache 2.0, for all parts of which Ronsor Labs or the Ryoko AI Production Committee may be considered authors. All other material is distributed under fair use principles. ### Contributions ``` @misc{kkr-ftts, title = {Falling through the Skies: Literature on unseen unique data}, author = {KaraKaraWitch}, year = {2023}, howpublished = {\url{https://huggingface.co/datasets/RyokoAI/FallingThroughTheSkies-592k}}, } ```
提供机构:
RyokoExtra
原始信息汇总

数据集概述

名称: Falling through the Skies

任务类别:

  • 文本分类
  • 文本生成
  • 文本到文本生成
  • 文本检索

标签: 不适合所有观众

大小: 100K<n<1M

语言: 主要为英语

许可证: MIT

多语言性: 单语种

数据集描述

概述: Falling through the Skies 是一个包含592k故事的未过滤文学数据集。

数据结构:

  • 数据实例: 数据存储在7z文件中,每个文件夹包含一个章节/短故事。文件格式为JSON。
  • 重要键:
    • content: 包含故事/章节内容。
    • meta: 包含多个关键字段,如tag, views, words, category, writers_pick, comment_count, desc, author
    • 根级键: url, title, content

数据创建:

  • 采集理由: 由于缺乏专注于情色文学的公开数据集而创建。
  • 源数据: 初始数据通过网站的公开JSON API收集。
  • 注释: 数据由API提供的脚本解析,无人工注释。

使用考虑:

  • 社会影响: 旨在帮助训练模型生成“更娱乐”的内容。
  • 偏见讨论: 数据集内容反映作者的偏见,包含NSFW材料且未过滤。

附加信息

数据集创建者: KaraKaraWitch

许可证信息: MIT

引用信息:

@misc{kkr-ftts, title = {Falling through the Skies: Literature on unseen unique data}, author = {KaraKaraWitch}, year = {2023}, howpublished = {url{https://huggingface.co/datasets/RyokoAI/FallingThroughTheSkies-592k}}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作