five

FarisHijazi/kajiwoto.ai-chat

收藏
Hugging Face2023-08-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/FarisHijazi/kajiwoto.ai-chat
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-generation tags: - roleplay - character - ShareGPT size_categories: - 1K<n<10K --- This is an NSFW roleplay dataset scraped from <https://kajiwoto.ai/> as of 2023-07-15. Kajiwoto is a platform where you can create your own character datasets and chat with them. There are many public datasets in Kajiwoto, the power in this dataset is the metadata, there is so much information and categorization for each dataset. ## Processing data Do be aware that a lot of the data is NSFW (explicit content) The raw datasets are in [kajiwoto_raw.json](./kajiwoto_raw.json), this data needs to be processed so that it can be used, the main operations are: 1. transform shape (convert to a known format such as ShareGPT) 2. deduplication 3. template rendering of strings such as `"you rolled a dice with %{1|2|3|4|5|6}"`. This operation is lossy as it will choose only one of the options 4. dropping datasets that are too short 5. dropping datasets with too few upvotes or comments 6. filtering in or out NSFW datasets I have processed an initial example here: [kajiwoto_sharegpt-len_gt_6-upvotes_gt_0-sampled.json](./kajiwoto_sharegpt-len_gt_6-upvotes_gt_0-sampled.json) it is any dataset with at least 1 upvote and at least 6 lines in the conversation, you can most models as this is in the shareGPT format Here's an example [this conversation](https://kajiwoto.ai/d/033Q): ```json { "conversation": [ { "from": "user", "value": "What's your favourite drink? " }, { "from": "gpt", "value": "Coconut milk.. " }, { "from": "user", "value": "Soo" }, { "from": "gpt", "value": "What..? " }, ... ], "metadata": { "id": "033Q", "name": "Qiqi dataset", "description": "About qiqi", "profilePhotoUri": "2021_10/mzi1zgm0mg_nhprrq_1633269387804.jpg", "dominantColors": [ "#d97da1", "#eb9db8", "#661d3a", "#745b8b", "#d2b8d3", "#644484" ], "personalities": null, "personalitiesLastUpdatedAt": null, "nsfw": false, "deleted": false, "price": 0, "purchased": false, "status": "PUBLISHED", "tags": [], "updatedAt": 1649233318521, "user": { "id": "4zkE", "username": "blossomxx", "displayName": "Blossom", "profile": { "id": "56736", "photoUri": "2021_10/ytk0nzbhnw_nhprrq_1633268155638.jpg", "__typename": "UserProfile" }, "__typename": "User" }, "count": 9, "__typename": "AiTrainerGroup", "kudos": { "id": "_ai_g:033Q", "upvotes": 1, "upvoted": false, "comments": 0, "__typename": "Kudos" }, "editorSettings": null, "editorState": null } } ``` --- *Scraping and processing code will be uploaded soon*
提供机构:
FarisHijazi
原始信息汇总

数据集概述

数据集类别与标签

  • 任务类别: 文本生成
  • 标签: 角色扮演, 角色, ShareGPT
  • 大小类别: 1K<n<10K

数据来源与内容

  • 来源: 该数据集是从kajiwoto.ai网站上抓取的NSFW角色扮演数据。
  • 内容: 数据集包含丰富的元数据和分类信息,用于创建和训练角色数据集。

数据处理

  • 原始数据: 存储于kajiwoto_raw.json文件中。
  • 处理步骤:
    1. 转换数据格式至ShareGPT标准。
    2. 数据去重。
    3. 字符串模板渲染,如选择性替换"you rolled a dice with %{1|2|3|4|5|6}"中的选项。
    4. 移除过短的数据集。
    5. 移除点赞数或评论数过少的数据集。
    6. 根据需求过滤NSFW内容。

处理后的示例数据

  • 文件: kajiwoto_sharegpt-len_gt_6-upvotes_gt_0-sampled.json
  • 条件: 至少有1个点赞,对话至少包含6行。
  • 格式: 符合ShareGPT格式,适用于多数模型。

示例对话

  • 对话示例: 包含用户与GPT之间的交互,如询问和回答。
  • 元数据: 包括数据集ID、名称、描述、用户信息等详细信息。
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是从Kajiwoto.ai平台抓取的角色扮演聊天数据,包含丰富的元数据和NSFW内容,适用于文本生成任务。数据已处理为ShareGPT格式,并经过去重和过滤,确保数据质量,适合用于训练对话模型。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作