OmniAICreator/Japanese-Roleplay-Dialogues

Name: OmniAICreator/Japanese-Roleplay-Dialogues
Creator: OmniAICreator
Published: 2024-06-08 16:25:27
License: 暂无描述

Hugging Face2024-06-08 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/OmniAICreator/Japanese-Roleplay-Dialogues

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: filtered path: Japanese-Roleplay-Dialogues-Filtered.jsonl - split: original path: Japanese-Roleplay-Dialogues-Original.jsonl license: apache-2.0 task_categories: - text-generation language: - ja tags: - roleplay size_categories: - 10K<n<100K --- # Japanese-Roleplay-Dialogues This is a dialogue corpus collected from Japanese role-playing forum (commonly known as "なりきりチャット(narikiri chat)"). Each record corresponds to a single thread. For the `original` version, no filtering has been applied. For the `filtered` version, the following filtering and cleaning conditions have been applied: - If the number of unique `poster` in the `posts` of each record is 1 or less, delete the entire record. - If the length of the `posts` is 10 or less, delete the entire record. - Normalize poster names by removing all whitespace characters (both full-width and half-width spaces) and anonymizing them. - Retain only records where the top two posters account for 90\% or more of the total posts. - Remove posts from posters other than the top two posters. - Among the top two posters, neither should account for more than 60\% of their combined posts. (i.e., each of the top two posters should account for at least 40\% of their combined posts). - Remove initial consecutive posts from the same poster until a different poster appears. - Remove final consecutive posts from the same poster until a different poster appears. - Merge consecutive posts from the same poster into a single post, separated by line breaks. Not all dialogues are purely role-playing. Some records include initial discussions about the settings, or they may continue from other threads. This dataset is intended for use in machine learning applications only.

提供机构：

OmniAICreator

原始信息汇总

日本角色扮演对话数据集

概述

该数据集是从日本角色扮演论坛（通常称为“なりきりチャット(narikiri chat)”）收集的对话语料库。每个记录对应一个单独的帖子。

配置

默认配置：
- 分割：
  - filtered：路径为 Japanese-Roleplay-Dialogues-Filtered.jsonl
  - original：路径为 Japanese-Roleplay-Dialogues-Original.jsonl

许可

该数据集使用 apache-2.0 许可证。

任务类别

文本生成

语言

日语

大小类别

10K<n<100K

数据处理

原始版本：未进行过滤。
过滤版本：应用了以下过滤和清洗条件：
- 如果每个记录中的 posts 的唯一 poster 数量为1或更少，则删除整个记录。
- 如果 posts 的长度为10或更少，则删除整个记录。
- 通过删除所有空白字符（包括全角和半角空格）并匿名化来规范化发布者名称。
- 仅保留前两名发布者占总帖子数90%或以上的记录。
- 删除前两名发布者以外的帖子。
- 在前两名发布者中，任何一个都不能占其合计帖子的60%以上（即，前两名发布者中的每一个应至少占其合计帖子的40%）。
- 删除同一发布者的初始连续帖子，直到出现不同的发布者。
- 删除同一发布者的最终连续帖子，直到出现不同的发布者。
- 将同一发布者的连续帖子合并为一个帖子，用换行符分隔。

备注

并非所有对话都是纯粹的角色扮演。一些记录包括关于设置的初始讨论，或者可能从其他帖子继续。
该数据集仅用于机器学习应用。

5,000+

优质数据集

54 个

任务类型

进入经典数据集

OmniAICreator/Japanese-Roleplay-Dialogues

日本角色扮演对话数据集

概述

配置

许可

任务类别

语言

标签

大小类别

数据处理

备注