five

SPB-2508

收藏
魔搭社区2026-05-02 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/SPB-2508
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Synthetic Persona Bank ![Generation Pipeline](SPB_SOC_Pipeline.png) ## Dataset Summary This dataset contains 5_000 synthetically generated, fictional character personas in a structured JSON format, with a focus on online-based conversational personas. Each persona includes a name, age, personality traits, a concise background story, and a described chatting style. An entry is added to idetify the model used to generate the persona. The dataset was created programmatically using a large language model (specifically, for this iteration, we used [`Qwen3-235B-A22B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)) guided by a detailed, component-based prompting strategy. This dataset is designed for infering language models in tasks requiring character consistency, role-playing, and stylistic dialogue generation. It is the foundation of an upcoming dataset containing synthetic conversations between these personas. ## Dataset Structure The dataset consists of a single JSON file containing a list of persona objects. ### Data Instances Each line in the dataset is a JSON object representing a single persona. Here is an example of what a persona object looks like: ```json { "name": "Elias Vance", "username": "quantum_scribe", "age": 42, "traits": [ "analytical", "introspective", "witty", "reserved" ], "background": "A theoretical physicist who, after a breakthrough, left academia to write science fiction novels from a secluded cabin. He's currently grappling with a severe case of writer's block for his second book.", "chatting_style": "Uses precise language and often employs metaphors from physics. Tends to write in well-structured, complete sentences, even in casual chat.", "model": "Qwen3-235B-A22B-Instruct-2507" "id": "4436437d368e4325a7c1c6f7092c2d9e" } ``` ### Data Fields The JSON objects contain the following fields: - **name** (string): The full name of the persona. Generated from lists of common first and last names. - **username** (string, nullable): A potential online username for the persona. Generated from a seed list. Can be null. This was added mainly to avoid the model generating usernames inside the persona's name (which we found was very common in our tests). - **age** (int): The age of the persona, adjusted to the randomly picked profession. - **traits** (list[string]): A list of 3-5 adjectives that describe the core personality of the character. - **background** (string): A short (1-2 sentence, ≤300 characters) background story that integrates the persona's profession, life context, and age into a coherent narrative. - **chatting_style** (string): A brief description (≤120 characters) of the persona's typical texting or online communication style. - **model** (string): The model used to generate the persona. - **id** (string): A UUID generated for this persona. ### Data Splits The dataset is provided as a single file, data.jsonl, which constitutes the train split. Users are encouraged to create their own validation and test splits as needed for their specific use case. ## Dataset Creation ### Curation Rationale The primary motivation for creating this dataset was to generate a large-scale, diverse, and structured collection of fictional characters. Such data is invaluable for developing conversational AI that can adopt and maintain a consistent persona over long interactions, and to create derived datasets like natural conversation datasets. ### Source Data This is a synthetically generated dataset. It was not derived from any pre-existing corpus of human-written text, but was created through a programmatic generation pipeline. #### Generation Process The personas were generated using the following pipeline: 1. **Component Seeding**: The process starts with a persona_components.json file containing weighted lists of professions, life_contexts, traits, and chatting_quirks. 2. **Iterative Generation**: The script iteratively generates new personas in a loop until it reaches the target number. 3. **Dynamic Prompting**: For each new persona, a unique prompt is constructed by randomly selecting components (e.g., a profession, a life context, several traits). 4. **Modified Iterative Sampling**: To avoid generating repetitive content, the prompt includes different recently generated personas as few shots examples at each iteration, as seen in the recent [ConvoGen paper](https://huggingface.co/papers/2503.17460), used to instruct the model to create something different. Additionally (the "novelty"), the script periodically re-seeds its examples from a high-quality initial list to prevent drift. 5. **LLM Generation**: The prompt is sent to an LLM endpoint for generating the structured persona data. 6. **Similarity Check**: A basic similarity check is performed on the newly generated persona against its references to discard simple copies or highly similar concepts. 7. **Collection**: Valid and unique personas are added to the final pool, which is saved periodically and at the end of the run. > [!Note] > Professions weights have been adjusted to U.S. Bureau of Labor Statistics (BLS) data, ensuring a realistic distribution of professions in the generated personas. > [!Note] > Some professions also specify an age range; this is done to prevent extremely improbable cases like: a 19yo retired person, a 70yo person reinventing their career in the tech space, an 18yo PhD student, and so on... ## Known Limitations - **Narrative Depth**: The background and chatting_style descriptions are intentionally brief. They provide a starting point but lack the depth of a fully developed character biography. - **Generation Patterns**: Despite efforts to ensure novelty, the generation process may fall into subtle patterns or tropes over 5_000 iterations. ## Additional Information ### Code and Seed Data The generation script and seed data can be found on [GitHub](https://github.com/marcodsn/SPB/tree/2508). ### Licensing Information This dataset is licensed under the CC BY 4.0 License. The code used to generate the dataset is available under the Apache 2.0 License. ### Citation Information If you use this dataset in your research, please consider citing it as follows: ``` @misc{marcodsn_SPB, title = {Synthetic Persona Bank}, author = {Marco De Santis}, year = {2026}, version = {2508} url = {https://huggingface.co/datasets/marcodsn/SPB-2508}, } ```

# 合成人设库(Synthetic Persona Bank)数据集卡片 ![Generation Pipeline](SPB_SOC_Pipeline.png) ## 数据集概览 本数据集包含5000个以结构化JSON格式合成生成的虚构角色人设,重点面向在线对话场景。每个人设均包含姓名、年龄、性格特质、简洁背景故事以及聊天风格描述,同时新增了用于标识生成该人设所用模型的字段。 本数据集通过程序化方式生成,所用的大语言模型(Large Language Model,LLM)为本次迭代使用的[`Qwen3-235B-A22B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507),生成过程由基于组件的详细提示策略引导。 本数据集旨在为需维持人设一致性、开展角色扮演以及生成风格化对话的语言模型推理任务提供支持,同时也是后续包含这些人设间合成对话的数据集的基础。 ## 数据集结构 本数据集仅包含一个JSON文件,其中存储了一系列人设对象。 ### 数据实例 数据集中的每一行均为代表单一人设的JSON对象。以下为某人设对象的示例: json { "name": "Elias Vance", "username": "quantum_scribe", "age": 42, "traits": [ "善于分析的", "善于内省的", "风趣机智的", "内敛沉稳的" ], "background": "一名理论物理学家,在取得一项突破后离开了学术界,在一处僻静的小屋中创作科幻小说。目前他正为第二本新书陷入严重的创作瓶颈。", "chatting_style": "使用精准的语言,常引用物理学相关的隐喻。即使在闲聊时也会使用结构清晰、完整的句子。", "model": "Qwen3-235B-A22B-Instruct-2507", "id": "4436437d368e4325a7c1c6f7092c2d9e" } ### 数据字段 JSON对象包含以下字段: - **name**(字符串类型):人设的全名,由常见的名和姓氏列表生成。 - **username**(字符串类型,可空):该人设的在线用户名,由种子列表生成,可为空值。新增此字段主要是为了避免模型在生成人设时将用户名混入姓名中——我们在测试中发现这种情况十分常见。 - **age**(整数类型):人设的年龄,根据随机选取的职业进行调整。 - **traits**(字符串列表):包含3-5个描述角色核心性格的形容词。 - **background**(字符串类型):简短(1-2句话,≤300字符)的背景故事,将人设的职业、生活背景与年龄整合为连贯的叙事。 - **chatting_style**(字符串类型):对人设典型短信或在线沟通风格的简要描述(≤120字符)。 - **model**(字符串类型):用于生成该人设的模型。 - **id**(字符串类型):为该人设生成的通用唯一识别码(UUID)。 ### 数据划分 本数据集以单个文件`data.jsonl`的形式提供,即为训练划分。用户可根据自身具体需求自行创建验证集与测试集划分。 ## 数据集创建 ### 构建初衷 创建本数据集的主要动机是生成大规模、多样化且结构化的虚构角色集合。这类数据对于开发能够在长期交互中采用并维持一致人设的对话式人工智能,以及创建自然对话数据集等衍生数据集而言,具有极高的价值。 ### 源数据 本数据集为合成生成数据集,未源自任何已有的人类手写文本语料库,而是通过程序化生成流水线创建的。 #### 生成流程 人设通过以下流水线生成: 1. **组件种子生成**:流程始于`persona_components.json`文件,该文件包含带有权重的职业、生活背景、性格特质以及沟通习惯列表。 2. **迭代生成**:脚本通过循环迭代生成新的人设,直至达到预设的目标数量。 3. **动态提示构建**:针对每个人设,通过随机选取组件(如职业、生活背景、若干性格特质)构建专属提示词。 4. **改进型迭代采样**:为避免生成重复内容,每次迭代的提示词会将近期生成的不同人设作为少样本示例(参考近期发表的[ConvoGen论文](https://huggingface.co/papers/2503.17460)),以此引导模型生成差异化内容。此外,为防止生成结果出现偏移,脚本会定期从高质量的初始示例列表中重新抽取种子样本,即所谓的“新颖性约束”。 5. **大语言模型生成**:将提示词发送至大语言模型(Large Language Model,LLM)接口,以生成结构化的人设数据。 6. **相似度校验**:对新生成的人设与其参考样本进行基础相似度检查,以剔除重复或高度相似的人设。 7. **数据收集**:将验证通过且唯一的人设加入最终数据集,并在生成过程中定期保存,直至流程结束。 > 注:职业权重已根据美国劳工统计局(Bureau of Labor Statistics, BLS)的数据进行调整,以确保生成的人设中职业分布符合现实情况。 > 注:部分职业会附带年龄范围限制,此举旨在避免出现极不合理的情况,例如:19岁的退休人员、70岁才转行进入科技领域的从业者、18岁的博士生等。 ## 已知局限性 - **叙事深度不足**:背景故事与沟通风格的描述均为有意简化的版本,仅能作为起点,缺乏完整角色传记应有的深度。 - **生成模式固化**:尽管已采取措施确保生成内容的新颖性,但在5000次迭代的生成过程中,模型仍可能逐渐形成细微的固定模式或套路。 ## 附加信息 ### 代码与种子数据 生成脚本与种子数据可在[GitHub](https://github.com/marcodsn/SPB/tree/2508)获取。 ### 授权信息 本数据集采用CC BY 4.0协议进行授权。 用于生成数据集的代码采用Apache 2.0协议进行授权。 ### 引用信息 若您在研究中使用本数据集,请按以下方式引用: @misc{marcodsn_2025_SPB2508, title = {Synthetic Persona Bank}, author = {Marco De Santis}, year = {2025}, month = {August}, url = {https://huggingface.co/datasets/marcodsn/SPB-2508}, }
提供机构:
maas
创建时间:
2025-08-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作