FarisHijazi/kajiwoto.ai-chat

Name: FarisHijazi/kajiwoto.ai-chat
Creator: FarisHijazi
Published: 2023-08-06 19:24:57
License: 暂无描述

Hugging Face2023-08-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/FarisHijazi/kajiwoto.ai-chat

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text-generation tags: - roleplay - character - ShareGPT size_categories: - 1K<n<10K --- This is an NSFW roleplay dataset scraped from <https://kajiwoto.ai/> as of 2023-07-15. Kajiwoto is a platform where you can create your own character datasets and chat with them. There are many public datasets in Kajiwoto, the power in this dataset is the metadata, there is so much information and categorization for each dataset. ## Processing data Do be aware that a lot of the data is NSFW (explicit content) The raw datasets are in [kajiwoto_raw.json](./kajiwoto_raw.json), this data needs to be processed so that it can be used, the main operations are: 1. transform shape (convert to a known format such as ShareGPT) 2. deduplication 3. template rendering of strings such as `"you rolled a dice with %{1|2|3|4|5|6}"`. This operation is lossy as it will choose only one of the options 4. dropping datasets that are too short 5. dropping datasets with too few upvotes or comments 6. filtering in or out NSFW datasets I have processed an initial example here: [kajiwoto_sharegpt-len_gt_6-upvotes_gt_0-sampled.json](./kajiwoto_sharegpt-len_gt_6-upvotes_gt_0-sampled.json) it is any dataset with at least 1 upvote and at least 6 lines in the conversation, you can most models as this is in the shareGPT format Here's an example [this conversation](https://kajiwoto.ai/d/033Q): ```json { "conversation": [ { "from": "user", "value": "What's your favourite drink? " }, { "from": "gpt", "value": "Coconut milk.. " }, { "from": "user", "value": "Soo" }, { "from": "gpt", "value": "What..? " }, ... ], "metadata": { "id": "033Q", "name": "Qiqi dataset", "description": "About qiqi", "profilePhotoUri": "2021_10/mzi1zgm0mg_nhprrq_1633269387804.jpg", "dominantColors": [ "#d97da1", "#eb9db8", "#661d3a", "#745b8b", "#d2b8d3", "#644484" ], "personalities": null, "personalitiesLastUpdatedAt": null, "nsfw": false, "deleted": false, "price": 0, "purchased": false, "status": "PUBLISHED", "tags": [], "updatedAt": 1649233318521, "user": { "id": "4zkE", "username": "blossomxx", "displayName": "Blossom", "profile": { "id": "56736", "photoUri": "2021_10/ytk0nzbhnw_nhprrq_1633268155638.jpg", "__typename": "UserProfile" }, "__typename": "User" }, "count": 9, "__typename": "AiTrainerGroup", "kudos": { "id": "_ai_g:033Q", "upvotes": 1, "upvoted": false, "comments": 0, "__typename": "Kudos" }, "editorSettings": null, "editorState": null } } ``` --- *Scraping and processing code will be uploaded soon*

提供机构：

FarisHijazi

原始信息汇总

数据集概述

数据集类别与标签

任务类别: 文本生成
标签: 角色扮演, 角色, ShareGPT
大小类别: 1K<n<10K

数据来源与内容

来源: 该数据集是从kajiwoto.ai网站上抓取的NSFW角色扮演数据。
内容: 数据集包含丰富的元数据和分类信息，用于创建和训练角色数据集。

数据处理

原始数据: 存储于kajiwoto_raw.json文件中。
处理步骤:
1. 转换数据格式至ShareGPT标准。
2. 数据去重。
3. 字符串模板渲染，如选择性替换"you rolled a dice with %{1|2|3|4|5|6}"中的选项。
4. 移除过短的数据集。
5. 移除点赞数或评论数过少的数据集。
6. 根据需求过滤NSFW内容。