OmniAICreator/Japanese-Roleplay-Dialogues
收藏Hugging Face2024-06-08 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/OmniAICreator/Japanese-Roleplay-Dialogues
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: filtered
path: Japanese-Roleplay-Dialogues-Filtered.jsonl
- split: original
path: Japanese-Roleplay-Dialogues-Original.jsonl
license: apache-2.0
task_categories:
- text-generation
language:
- ja
tags:
- roleplay
size_categories:
- 10K<n<100K
---
# Japanese-Roleplay-Dialogues
This is a dialogue corpus collected from Japanese role-playing forum (commonly known as "なりきりチャット(narikiri chat)"). Each record corresponds to a single thread.
For the `original` version, no filtering has been applied.
For the `filtered` version, the following filtering and cleaning conditions have been applied:
- If the number of unique `poster` in the `posts` of each record is 1 or less, delete the entire record.
- If the length of the `posts` is 10 or less, delete the entire record.
- Normalize poster names by removing all whitespace characters (both full-width and half-width spaces) and anonymizing them.
- Retain only records where the top two posters account for 90\% or more of the total posts.
- Remove posts from posters other than the top two posters.
- Among the top two posters, neither should account for more than 60\% of their combined posts. (i.e., each of the top two posters should account for at least 40\% of their combined posts).
- Remove initial consecutive posts from the same poster until a different poster appears.
- Remove final consecutive posts from the same poster until a different poster appears.
- Merge consecutive posts from the same poster into a single post, separated by line breaks.
Not all dialogues are purely role-playing. Some records include initial discussions about the settings, or they may continue from other threads.
This dataset is intended for use in machine learning applications only.
提供机构:
OmniAICreator
原始信息汇总
日本角色扮演对话数据集
概述
该数据集是从日本角色扮演论坛(通常称为“なりきりチャット(narikiri chat)”)收集的对话语料库。每个记录对应一个单独的帖子。
配置
- 默认配置:
- 分割:
filtered:路径为Japanese-Roleplay-Dialogues-Filtered.jsonloriginal:路径为Japanese-Roleplay-Dialogues-Original.jsonl
- 分割:
许可
- 该数据集使用
apache-2.0许可证。
任务类别
- 文本生成
语言
- 日语
标签
- 角色扮演
大小类别
- 10K<n<100K
数据处理
- 原始版本:未进行过滤。
- 过滤版本:应用了以下过滤和清洗条件:
- 如果每个记录中的
posts的唯一poster数量为1或更少,则删除整个记录。 - 如果
posts的长度为10或更少,则删除整个记录。 - 通过删除所有空白字符(包括全角和半角空格)并匿名化来规范化发布者名称。
- 仅保留前两名发布者占总帖子数90%或以上的记录。
- 删除前两名发布者以外的帖子。
- 在前两名发布者中,任何一个都不能占其合计帖子的60%以上(即,前两名发布者中的每一个应至少占其合计帖子的40%)。
- 删除同一发布者的初始连续帖子,直到出现不同的发布者。
- 删除同一发布者的最终连续帖子,直到出现不同的发布者。
- 将同一发布者的连续帖子合并为一个帖子,用换行符分隔。
- 如果每个记录中的
备注
- 并非所有对话都是纯粹的角色扮演。一些记录包括关于设置的初始讨论,或者可能从其他帖子继续。
- 该数据集仅用于机器学习应用。



