Japanese-Roleplay-Dialogues
收藏魔搭社区2025-08-08 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/OmniAICreator/Japanese-Roleplay-Dialogues
下载链接
链接失效反馈官方服务:
资源简介:
# Japanese-Roleplay-Dialogues
This is a dialogue corpus collected from Japanese role-playing forum (commonly known as "なりきりチャット(narikiri chat)"). Each record corresponds to a single thread.
For the `original` version, no filtering has been applied.
For the `filtered` version, the following filtering and cleaning conditions have been applied:
- If the number of unique `poster` in the `posts` of each record is 1 or less, delete the entire record.
- If the length of the `posts` is 10 or less, delete the entire record.
- Normalize poster names by removing all whitespace characters (both full-width and half-width spaces) and anonymizing them.
- Retain only records where the top two posters account for 90\% or more of the total posts.
- Remove posts from posters other than the top two posters.
- Among the top two posters, neither should account for more than 60\% of their combined posts. (i.e., each of the top two posters should account for at least 40\% of their combined posts).
- Remove initial consecutive posts from the same poster until a different poster appears.
- Remove final consecutive posts from the same poster until a different poster appears.
- Merge consecutive posts from the same poster into a single post, separated by line breaks.
Not all dialogues are purely role-playing. Some records include initial discussions about the settings, or they may continue from other threads.
This dataset is intended for use in machine learning applications only.
# 日语角色扮演对话数据集(Japanese-Roleplay-Dialogues)
本数据集为采集自日本角色扮演论坛(俗称「なりきりチャット(narikiri chat)」)的对话语料库,每条记录对应一个独立论坛讨论串(thread)。
针对`original`(原始)版本,未施加任何过滤与清洗操作。
针对`filtered`(过滤后)版本,已应用如下过滤与清洗规则:
1. 若单条记录的`posts`(发言列表)中唯一`poster`(发言者)数量不超过1,则删除该整条记录;
2. 若`posts`(发言列表)的总条数≤10,则删除该整条记录;
3. 对发言者姓名进行归一化处理:移除所有全角、半角空白字符,并完成匿名化;
4. 仅保留前两大发言者的发言总量占该记录总发言量≥90%的记录;
5. 移除除前两大发言者之外的其他所有发言者的发言内容;
6. 在前两大发言者中,任一发言者的发言占二者总发言量的比例均不得超过60%(即二者的发言占比均需≥40%);
7. 移除开头连续的同发言者发言,直至出现不同发言者为止;
8. 移除结尾连续的同发言者发言,直至出现不同发言者为止;
9. 将同一发言者的连续发言合并为单条发言,以换行符作为分隔符。
并非所有对话均为纯粹的角色扮演内容,部分记录包含前期的设定讨论环节,或承接自其他讨论串。本数据集仅可用于机器学习相关应用。
提供机构:
maas
创建时间:
2025-07-07



