five

Japanese-Roleplay-Dialogues

收藏
魔搭社区2025-08-08 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/OmniAICreator/Japanese-Roleplay-Dialogues
下载链接
链接失效反馈
官方服务:
资源简介:
# Japanese-Roleplay-Dialogues This is a dialogue corpus collected from Japanese role-playing forum (commonly known as "なりきりチャット(narikiri chat)"). Each record corresponds to a single thread. For the `original` version, no filtering has been applied. For the `filtered` version, the following filtering and cleaning conditions have been applied: - If the number of unique `poster` in the `posts` of each record is 1 or less, delete the entire record. - If the length of the `posts` is 10 or less, delete the entire record. - Normalize poster names by removing all whitespace characters (both full-width and half-width spaces) and anonymizing them. - Retain only records where the top two posters account for 90\% or more of the total posts. - Remove posts from posters other than the top two posters. - Among the top two posters, neither should account for more than 60\% of their combined posts. (i.e., each of the top two posters should account for at least 40\% of their combined posts). - Remove initial consecutive posts from the same poster until a different poster appears. - Remove final consecutive posts from the same poster until a different poster appears. - Merge consecutive posts from the same poster into a single post, separated by line breaks. Not all dialogues are purely role-playing. Some records include initial discussions about the settings, or they may continue from other threads. This dataset is intended for use in machine learning applications only.

# 日语角色扮演对话数据集(Japanese-Roleplay-Dialogues) 本数据集为采集自日本角色扮演论坛(俗称「なりきりチャット(narikiri chat)」)的对话语料库,每条记录对应一个独立论坛讨论串(thread)。 针对`original`(原始)版本,未施加任何过滤与清洗操作。 针对`filtered`(过滤后)版本,已应用如下过滤与清洗规则: 1. 若单条记录的`posts`(发言列表)中唯一`poster`(发言者)数量不超过1,则删除该整条记录; 2. 若`posts`(发言列表)的总条数≤10,则删除该整条记录; 3. 对发言者姓名进行归一化处理:移除所有全角、半角空白字符,并完成匿名化; 4. 仅保留前两大发言者的发言总量占该记录总发言量≥90%的记录; 5. 移除除前两大发言者之外的其他所有发言者的发言内容; 6. 在前两大发言者中,任一发言者的发言占二者总发言量的比例均不得超过60%(即二者的发言占比均需≥40%); 7. 移除开头连续的同发言者发言,直至出现不同发言者为止; 8. 移除结尾连续的同发言者发言,直至出现不同发言者为止; 9. 将同一发言者的连续发言合并为单条发言,以换行符作为分隔符。 并非所有对话均为纯粹的角色扮演内容,部分记录包含前期的设定讨论环节,或承接自其他讨论串。本数据集仅可用于机器学习相关应用。
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作