SOC-2508

Name: SOC-2508
Creator: maas
Published: 2025-12-05 16:44:58
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-09 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/SOC-2508

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Synthetic Online Conversations ![Generation Pipeline](SPB_SOC_Pipeline.png) ## Dataset Summary This dataset contains over 1,180 synthetically generated, multi-turn online conversations. Each conversation is a complete dialogue between two fictional personas drawn from the [Synthetic Persona Bank (SPB-2508)](https://huggingface.co/datasets/marcodsn/SPB-2508) dataset. The dataset was created using a multi-stage programmatic pipeline (inspired by [ConvoGen](https://huggingface.co/papers/2503.17460)) driven by a large language model ([Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)). The generation process was guided by detailed instructions to produce natural, context-aware, and stylistically consistent dialogues with human-like imperfections, realistic conflict, and simulated multimedia elements. Unlike typical synthetic chats, these conversations explicitly model real-time online behavior: participants may send multiple consecutive messages per turn, include human-like delays between replies, and “attach” simulated multimedia using lightweight XML-like tags (e.g., `<image>`, `<gif>`, `<audio>`, `<video>`, `<delay>`, `<end/>`). This makes the conversations feel more like actual messaging app threads rather than single-turn exchanges with a chatbot. You can visualize the generated conversations using the [SOC Visualizer](https://huggingface.co/spaces/marcodsn/SOC_Visualizer) HF Space. ## Dataset Structure The dataset consists of a single JSONL file where each line is a JSON object representing a complete conversation. ### Data Instances Each line in the dataset is a JSON object representing a single chat. Here is an example of what a chat object looks like: ```json { "chat_id": "4436437d368e4325a7c1c6f7092c2d9e_f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5_1754636647", "experience": { "persona1": { "name": "Elias Vance", "username": "quantum_scribe", "age": 42, "traits": ["analytical", "introspective", "witty", "reserved"], "background": "A theoretical physicist who, after a breakthrough, left academia to write science fiction novels from a secluded cabin. He's currently grappling with a severe case of writer's block for his second book.", "chatting_style": "Uses precise language and often employs metaphors from physics. Tends to write in well-structured, complete sentences, even in casual chat.", "model": "Qwen3-235B-A22B-Instruct-2507", "id": "4436437d368e4325a7c1c6f7092c2d9e" }, "persona2": { "name": "Luna Reyes", "username": "StardustSketcher", "age": 28, "traits": ["creative", "optimistic", "daydreamer", "empathetic"], "background": "A freelance digital artist who illustrates children's books and streams her drawing process online. She finds inspiration in mythology and the night sky.", "chatting_style": "Uses a lot of emojis and kaomoji (´｡• ᵕ •｡`). Her messages are often short, enthusiastic, and full of creative typos.", "model": "Qwen3-235B-A22B-Instruct-2507", "id": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5" }, "relationship": "Strangers who met in a 'Vintage Sci-Fi Book Club' Discord server.", "situation": "Elias posted a message asking for recommendations to overcome writer's block, and Luna, a fellow member, decided to DM him directly to offer some creative, non-traditional advice.", "topic": "I saw your post in the #writing- woes channel and had a few weird ideas that might help! Mind if I share?", "id": "c1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6" }, "chat_parts": [ { "sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5", "messages": [ "Hiii Elias! Saw your post in #writing-woes. I know the feeling (art block is the wooooorst 😭).", "Had a few maybe-weird ideas if you're open to them? ✨" ] }, { "sender": "4436437d368e4325a7c1c6f7092c2d9e", "messages": [ "<delay minutes=\"5\"/>", "Hello, Luna. I appreciate the outreach. At this point, I am receptive to any and all suggestions, regardless of their position on the conventionality spectrum." ] }, { "sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5", "messages": [ "Awesome! Okay, so forget writing. Just for a day. Go outside tonight and just... look at the stars. No pressure, just observe.", "Like you're cataloging them for a galactic library. What do they make you *feel*?", "<gif>animated gif of twinkling stars from a 90s anime</gif>" ] }, { "sender": "4436437d368e4325a7c1c6f7092c2d9e", "messages": [ "An interesting proposition. A purely observational, non-analytical exercise. It has a certain... elegance. I will attempt it.", "Thank you. <end/>" ] } ], "model": "Qwen3-235B-A22B-Instruct-2507" } ``` ### Data Fields Each JSON object contains the following fields: - **chat_id** (string): A unique identifier for the conversation. - **experience** (object): An object containing the full context for the conversation. - **persona1** & **persona2** (object): The complete persona objects from the SPB-2508 dataset for the two participants. - **relationship** (string): A brief description of how the two personas know each other. - **situation** (string): The specific online context or reason for the conversation starting. - **topic** (string): The opening line or subject that kicks off the dialogue. - **id** (string): A unique identifier for the experience object itself. - **chat_parts** (list[object]): A list of objects, where each object represents one turn in the conversation. - **sender** (string): The ID of the persona who sent the messages in this turn. - **messages** (list[string]): A list of one or more messages sent by the persona in this turn. Can include special XML-like tags. - **model** (string): The model used to generate the conversation. ### Data Splits The dataset is provided as a single file, which constitutes the `train` split. Users are encouraged to create their own validation and test splits as needed for their specific use cases. ## Dataset Creation ### Curation Rationale This dataset was created to address the need for large-scale, high-quality conversational data that goes beyond simple question-answering. The goal was to generate dialogues that exhibit deep persona consistency, natural topic progression, and the messy, imperfect nature of real online chats. By building directly on the structured `SPB-2508` persona bank, we ensure each conversation is grounded in a rich, pre-defined context. ### Source Data This is a synthetically generated dataset. Its primary source is the `marcodsn/SPB-2508` dataset, which provides the character personas. The conversational scenarios and dialogue were generated through a programmatic, multi-stage pipeline. ### Generation Process The conversations were generated using a three-stage pipeline: 1. **Stage 1: Experience Generation**: - Two personas were selected from the `SPB-2508` pool, with a weighting system to favor pairing personas of similar age, promoting more plausible interactions. - A relationship context (e.g., "old friends from college," "strangers in a gaming lobby") was dynamically constructed from seed components. - These elements were fed to the LLM, which generated a unique `situation` (the reason for the chat) and a starting `topic` (the opening line). Few-shot examples were used to encourage novelty. 2. **Stage 2: Conversational Rollout**: - Each generated "experience" served as a prompt for a new conversation. - The LLM generated the dialogue turn-by-turn, alternating between the two personas. - The prompt for each turn included the full persona details, the initial scenario, and the entire chat history up to that point. - The LLM was given a rich set of instructions to encourage realism, including: - **Human Imperfection**: Allowing for typos, topic drift, and variable effort in replies. - **Realistic Conflict**: Guiding the model to handle disagreements without immediate, clean resolutions. - **Special Tags**: Using XML-like tags to simulate online features: `<image>`, `<gif>`, `<audio>`, `<video>` for multimedia; `<delay>` to simulate response times; and `<end/>` to signal a natural conclusion to the chat. 3. **Stage 3: Post-Processing and Cleaning**: - The raw generated chats were collected and merged. - A cleaning script removed duplicates, filtered out conversations that were too short (fewer than two turns), and scrubbed artifacts like model-inserted speaker names (e.g., "Elias Vance: Hello"). - The final, cleaned dataset was shuffled to ensure random distribution. ## Known Limitations - **Synthetic Nature**: While designed for realism, the dialogues are synthetic and may not capture the full chaotic unpredictability of genuine human interaction. - **Inherited Bias**: Any biases, stereotypes, or patterns present in the source `SPB-2508` dataset will be inherited and potentially amplified in these conversations. - **Tag Frequency**: The use of special tags (`<image>`, `<delay>`, etc.) is not uniform across all conversations, as their inclusion was left to the model's discretion during generation. - **Conversation Endings**: The `<end/>` tag provides a clear signal but might lead to some conversations concluding more formulaically than they would in the wild. - **Instruction Following**: The LLM we used is not perfect and various problems derived from poor instruction following have been found; furthermore, the `<end/>` tag is often used prematurely. We will try to solve these issues in a future release. ## Additional Information ### Code and Seed Data The generation scripts and seed data can be found on [GitHub](https://github.com/marcodsn/SOC/tree/2508). ### Licensing Information This dataset is licensed under the CC BY 4.0 License. The code used to generate the dataset is available under the Apache 2.0 License. ### Citation Information If you use this dataset in your research, please consider citing it as follows: ```bibtex @misc{marcodsn_2025_SOC2508, title = {Synthetic Online Conversations}, author = {Marco De Santis}, year = {2025}, month = {August}, url = {https://huggingface.co/datasets/marcodsn/SOC-2508}, } ```

# 合成在线对话数据集卡片（Dataset Card for Synthetic Online Conversations） ![生成流水线（Generation Pipeline）](SPB_SOC_Pipeline.png) ## 数据集概览（Dataset Summary）本数据集包含1180余条人工合成的多轮在线对话。每一条对话均为两名虚构角色间的完整交互，角色素材取自[Synthetic Persona Bank (SPB-2508)](https://huggingface.co/datasets/marcodsn/SPB-2508)数据集。本数据集依托大语言模型（Large Language Model，LLM）[Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)驱动的多阶段程序化流水线（灵感源自[ConvoGen](https://huggingface.co/papers/2503.17460)）构建而成。生成过程遵循详细指令，旨在生成自然、具备上下文感知能力且风格统一的对话，同时融入类人的不完美性、真实的冲突场景与模拟的多媒体元素。与典型的合成对话不同，本数据集的对话显式建模了实时在线交互行为：参与者可在单轮中发送多条连续消息，回复间加入类人的延迟，并通过轻量级类XML标签（如`<image>`、`<gif>`、`<audio>`、`<video>`、`<delay>`、`<end/>`）“附加”模拟多媒体内容。这使得对话更贴近真实的即时通讯应用聊天串，而非单轮的机器人交互对话。你可以通过[SOC可视化工具（SOC Visualizer）](https://huggingface.co/spaces/marcodsn/SOC_Visualizer) Hugging Face空间可视化生成的对话。 ## 数据集结构（Dataset Structure）本数据集仅包含一个JSONL文件，文件中每一行均为代表一条完整对话的JSON对象。 ### 数据实例（Data Instances）数据集中的每一行均为代表单条聊天的JSON对象，以下为聊天对象的示例： json { "chat_id": "4436437d368e4325a7c1c6f7092c2d9e_f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5_1754636647", "experience": { "persona1": { "name": "Elias Vance", "username": "quantum_scribe", "age": 42, "traits": ["analytical", "introspective", "witty", "reserved"], "background": "A theoretical physicist who, after a breakthrough, left academia to write science fiction novels from a secluded cabin. He's currently grappling with a severe case of writer's block for his second book.", "chatting_style": "Uses precise language and often employs metaphors from physics. Tends to write in well-structured, complete sentences, even in casual chat.", "model": "Qwen3-235B-A22B-Instruct-2507", "id": "4436437d368e4325a7c1c6f7092c2d9e" }, "persona2": { "name": "Luna Reyes", "username": "StardustSketcher", "age": 28, "traits": ["creative", "optimistic", "daydreamer", "empathetic"], "background": "A freelance digital artist who illustrates children's books and streams her drawing process online. She finds inspiration in mythology and the night sky.", "chatting_style": "Uses a lot of emojis and kaomoji (´｡• ᵕ •｡`). Her messages are often short, enthusiastic, and full of creative typos.", "model": "Qwen3-235B-A22B-Instruct-2507", "id": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5" }, "relationship": "Strangers who met in a 'Vintage Sci-Fi Book Club' Discord server.", "situation": "Elias posted a message asking for recommendations to overcome writer's block, and Luna, a fellow member, decided to DM him directly to offer some creative, non-traditional advice.", "topic": "I saw your post in the #writing- woes channel and had a few weird ideas that might help! Mind if I share?", "id": "c1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6" }, "chat_parts": [ { "sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5", "messages": [ "Hiii Elias! Saw your post in #writing-woes. I know the feeling (art block is the wooooorst 😭).", "Had a few maybe-weird ideas if you're open to them? ✨" ] }, { "sender": "4436437d368e4325a7c1c6f7092c2d9e", "messages": [ "<delay minutes="5"/>", "Hello, Luna. I appreciate the outreach. At this point, I am receptive to any and all suggestions, regardless of their position on the conventionality spectrum." ] }, { "sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5", "messages": [ "Awesome! Okay, so forget writing. Just for a day. Go outside tonight and just... look at the stars. No pressure, just observe.", "Like you're cataloging them for a galactic library. What do they make you *feel*?", "<gif>animated gif of twinkling stars from a 90s anime</gif>" ] }, { "sender": "4436437d368e4325a7c1c6f7092c2d9e", "messages": [ "An interesting proposition. A purely observational, non-analytical exercise. It has a certain... elegance. I will attempt it.", "Thank you. <end/>" ] } ], "model": "Qwen3-235B-A22B-Instruct-2507" } ### 数据字段（Data Fields）每个JSON对象包含以下字段： - **chat_id**（字符串类型）：对话的唯一标识符。 - **experience**（对象类型）：包含对话完整上下文的对象。 - **persona1**与**persona2**（对象类型）：两名参与者的完整角色对象，取自SPB-2508数据集。 - **relationship**（字符串类型）：两名角色相识关系的简要描述。 - **situation**（字符串类型）：对话发起的具体在线场景或缘由。 - **topic**（字符串类型）：开启对话的开场白或主题。 - **id**（字符串类型）：上下文对象自身的唯一标识符。 - **chat_parts**（对象列表类型）：代表对话每一轮的对象列表。 - **sender**（字符串类型）：当前轮次发送消息的角色ID。 - **messages**（字符串列表类型）：当前轮次角色发送的一条或多条消息，可包含特殊类XML标签。 - **model**（字符串类型）：用于生成对话的模型。 ### 数据划分（Data Splits）本数据集仅提供单个文件，作为`train`训练划分集。鼓励用户根据自身具体需求，自行创建验证集与测试划分集。 ## 数据集构建（Dataset Creation） ### 筛选依据（Curation Rationale）本数据集的构建旨在满足对大规模高质量对话数据的需求，这类数据需超越简单的问答式交互。其目标是生成具备深度角色一致性、自然主题演进，且贴合真实在线聊天杂乱不完美特质的对话。通过直接基于结构化的SPB-2508角色库构建，我们确保每条对话均依托于丰富的预定义上下文。 ### 源数据（Source Data）本数据集为人工合成生成，其核心源数据为`marcodsn/SPB-2508`数据集，该数据集提供了角色素材。对话场景与交互内容通过多阶段程序化流水线生成。 ### 生成流程（Generation Process）对话通过三阶段流水线生成： 1. **阶段1：上下文生成（Experience Generation）**： - 从SPB-2508角色库中选取两名角色，采用加权机制优先匹配年龄相近的角色，以提升交互的合理性。 - 从种子组件中动态构建角色相识关系（例如“大学旧友”、“游戏大厅陌生人”）。 - 将上述元素输入大语言模型，由其生成独特的对话发起场景（`situation`，即聊天缘由）与初始主题（`topic`，即开场白）。采用少样本（Few-shot）示例以鼓励内容的新颖性。 2. **阶段2：对话展开（Conversational Rollout）**： - 每条生成的“上下文”作为新对话的提示词。 - 大语言模型逐轮生成对话，在两名角色间交替切换。 - 每一轮的提示词包含完整的角色细节、初始场景以及截至当前的完整聊天历史。 - 为大语言模型提供了丰富的指令集以提升对话真实性，具体包括： - **类人不完美性**：允许出现拼写错误、主题偏移以及回复精力差异等情况。 - **真实冲突**：引导模型处理分歧时不采用直接且完美的解决方案。 - **特殊标签**：使用类XML标签模拟在线交互功能：`<image>`、`<gif>`、`<audio>`、`<video>`用于多媒体内容；`<delay>`用于模拟回复延迟；`<end/>`用于标记对话自然结束。 3. **阶段3：后处理与清洗（Post-Processing and Cleaning）**： - 收集并合并原始生成的聊天内容。 - 通过清洗脚本移除重复内容，过滤过短（轮次少于2轮）的对话，并清理模型插入的说话人名称等人工痕迹，例如“Elias Vance: Hello”。 - 对最终清洗后的数据集进行洗牌，以确保随机分布。 ## 已知局限性（Known Limitations） - **合成属性**：尽管旨在贴近真实，但对话仍为合成生成，可能无法完全复刻真实人类交互的混乱不可预测性。 - **继承偏差**：源数据集SPB-2508中存在的任何偏差、刻板印象或模式，都将被继承并可能在本对话集中被放大。 - **标签使用频率不均**：特殊标签（如`<image>`、`<delay>`等）的使用在所有对话中并不统一，因为生成过程中是否使用标签由模型自主决定。 - **对话收尾**：`<end/>`标签提供了明确的结束信号，但可能导致部分对话的收尾比真实场景更程序化。 - **指令遵循能力**：我们使用的大语言模型并非完美，已发现多种因指令遵循不佳导致的问题；此外，`<end/>`标签常被过早使用。我们将在未来版本中尝试解决这些问题。 ## 补充信息（Additional Information） ### 代码与种子数据（Code and Seed Data）生成脚本与种子数据可在[GitHub](https://github.com/marcodsn/SOC/tree/2508)获取。 ### 授权信息（Licensing Information）本数据集采用CC BY 4.0协议授权。用于生成数据集的代码采用Apache 2.0协议授权。 ### 引用信息（Citation Information）若您在研究中使用本数据集，请按以下方式引用： bibtex @misc{marcodsn_2025_SOC2508, title = {Synthetic Online Conversations}, author = {Marco De Santis}, year = {2025}, month = {August}, url = {https://huggingface.co/datasets/marcodsn/SOC-2508}, }

提供机构：

maas

创建时间：

2025-08-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集