dinalt/roleplay_build
收藏Hugging Face2024-04-01 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/dinalt/roleplay_build
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
pretty_name: Roleplay Build
dataset_info:
features:
- name: pairing_reason
dtype: string
- name: plist
dtype: string
- name: director_log
list:
- name: content
dtype: string
- name: index
dtype: int64
- name: name
dtype: string
- name: role
dtype: string
- name: tokens
dtype: int64
- name: scenario
dtype: string
- name: proxy
struct:
- name: description
dtype: string
- name: example_dialog
sequence: string
- name: greeting
dtype: string
- name: name
dtype: string
- name: plist
dtype: string
- name: preset
dtype: string
- name: summary
dtype: string
- name: system
struct:
- name: content
dtype: string
- name: name
dtype: string
- name: role
dtype: string
- name: tokens
dtype: int64
- name: example_dialog
sequence: string
- name: conversation
list:
- name: content
dtype: string
- name: name
dtype: string
- name: role
dtype: string
- name: tokens
dtype: int64
- name: char_name
dtype: string
- name: description
dtype: string
- name: summary
dtype: string
- name: preset
dtype: string
- name: greeting
dtype: string
splits:
- name: train
num_bytes: 117706519.38715278
num_examples: 2770
download_size: 63025218
dataset_size: 117706519.38715278
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Dataset Card for Dataset Name
RPBuild is a prototype synthetic roleplay dataset built with the [RPBuild library](https://github.com/jdinalt/rpbuild).
## Dataset Details
### Dataset Description
RPBuild is a synthetic dataset intended to help improve language models roleplay abilities. This is an early prototype for evaluation.
I would love to receive feedback!
You can find example code for converting the dataset to various formats [here](https://github.com/jdinalt/rpbuild/blob/main/notebooks/formatting_examples.ipynb).
### Dataset Sources
- **Character Seed Data:** https://huggingface.co/datasets/hieunguyenminh/roleplay
- **Meta Data:** "[mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)"
- **Writer, Actors, and Director:** "[fhai50032/RolePlayLake-7B](https://huggingface.co/fhai50032/RolePlayLake-7B)"
## Dataset Creation
The generation was achived by using "[hieunguyenminh/roleplay](https://huggingface.co/datasets/hieunguyenminh/roleplay)" as a seed dataset.
Mistral 7B was used to expand the characer names and summaries from the original seed dataset to a full set of character data,
based upon the format described in the [Silly Tavern documentation](https://docs.sillytavern.app/usage/core-concepts/characterdesign/).
For each character record in the metadata, a second character was selected to act as a "proxy" for a real user.
A plot outline was generated from the two selected characters, an "actor" agent was created for each character and
the two actors generated the dialog under the guidance of a "director" agent; without guidance, the interaction tends to stagnate.
Dialog was generated until a context of at least 4000 tokens was generated.
In theory, the "directors" instructions can be used to train a model to follow the user's instructions for roleplay. They could also be formatted as to appear as internal thoughts of the character.
### Curation Rationale
The dataset was created for an experiment I have in mind which requires that the model is proficient at roleplay. I was unable to locate a dataset meeting
my needs, so I created the [rpbuild library](https://github.com/jdinalt/rpbuild) to faciliate this. This dataset is an early proof-of-concept for the generation
methodology.
### Revision Notes
After training a small model on the original revision of the dataset, I discovered an unusually high number of generations including the name "Alice,"
who was not mentioned in the preceding dialog or director's instructions. This appears to have been an artifact of using the name "Alice" in the dialog
generation instructions as an example, thus "Alice" was present in about half of the original dataset rows.
As a mitigation, I have filtered all rows containing "Alice" in the updated dataset.
While searching for issues, I also found frequent generations resembling the PList format and instances of character impersonation in both the primary dialog and the
director's instructions. These have been removed by regex substitution, removing the text starting with the PList/impersonation and any followiing text.
I am working on improving the quality control of the generations in the library, including using the model for detecting issues and triggering "retakes."
Once this is working, I will rebuilt the full dataset and update it.
Let me know if you find any other issues.
提供机构:
dinalt
原始信息汇总
数据集概述
数据集名称
RPBuild
数据集描述
RPBuild是一个旨在帮助改进语言模型角色扮演能力的合成数据集。这是一个用于评估的早期原型。
数据集特征
- pairing_reason: 字符串
- plist: 字符串
- director_log: 列表
- content: 字符串
- index: 整数
- name: 字符串
- role: 字符串
- tokens: 整数
- scenario: 字符串
- proxy: 结构体
- description: 字符串
- example_dialog: 序列,字符串
- greeting: 字符串
- name: 字符串
- plist: 字符串
- preset: 字符串
- summary: 字符串
- system: 结构体
- content: 字符串
- name: 字符串
- role: 字符串
- tokens: 整数
- example_dialog: 序列,字符串
- conversation: 列表
- content: 字符串
- name: 字符串
- role: 字符串
- tokens: 整数
- char_name: 字符串
- description: 字符串
- summary: 字符串
- preset: 字符串
- greeting: 字符串
数据集划分
- train: 2770个样本,大小为117706519.38715278字节
数据集大小
- 下载大小: 63025218字节
- 数据集大小: 117706519.38715278字节
许可证
CC-BY-4.0



