five

SOC-2508

收藏
魔搭社区2025-12-05 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/SOC-2508
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Synthetic Online Conversations ![Generation Pipeline](SPB_SOC_Pipeline.png) ## Dataset Summary This dataset contains over 1,180 synthetically generated, multi-turn online conversations. Each conversation is a complete dialogue between two fictional personas drawn from the [Synthetic Persona Bank (SPB-2508)](https://huggingface.co/datasets/marcodsn/SPB-2508) dataset. The dataset was created using a multi-stage programmatic pipeline (inspired by [ConvoGen](https://huggingface.co/papers/2503.17460)) driven by a large language model ([Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)). The generation process was guided by detailed instructions to produce natural, context-aware, and stylistically consistent dialogues with human-like imperfections, realistic conflict, and simulated multimedia elements. Unlike typical synthetic chats, these conversations explicitly model real-time online behavior: participants may send multiple consecutive messages per turn, include human-like delays between replies, and “attach” simulated multimedia using lightweight XML-like tags (e.g., `<image>`, `<gif>`, `<audio>`, `<video>`, `<delay>`, `<end/>`). This makes the conversations feel more like actual messaging app threads rather than single-turn exchanges with a chatbot. You can visualize the generated conversations using the [SOC Visualizer](https://huggingface.co/spaces/marcodsn/SOC_Visualizer) HF Space. ## Dataset Structure The dataset consists of a single JSONL file where each line is a JSON object representing a complete conversation. ### Data Instances Each line in the dataset is a JSON object representing a single chat. Here is an example of what a chat object looks like: ```json { "chat_id": "4436437d368e4325a7c1c6f7092c2d9e_f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5_1754636647", "experience": { "persona1": { "name": "Elias Vance", "username": "quantum_scribe", "age": 42, "traits": ["analytical", "introspective", "witty", "reserved"], "background": "A theoretical physicist who, after a breakthrough, left academia to write science fiction novels from a secluded cabin. He's currently grappling with a severe case of writer's block for his second book.", "chatting_style": "Uses precise language and often employs metaphors from physics. Tends to write in well-structured, complete sentences, even in casual chat.", "model": "Qwen3-235B-A22B-Instruct-2507", "id": "4436437d368e4325a7c1c6f7092c2d9e" }, "persona2": { "name": "Luna Reyes", "username": "StardustSketcher", "age": 28, "traits": ["creative", "optimistic", "daydreamer", "empathetic"], "background": "A freelance digital artist who illustrates children's books and streams her drawing process online. She finds inspiration in mythology and the night sky.", "chatting_style": "Uses a lot of emojis and kaomoji (´。• ᵕ •。`). Her messages are often short, enthusiastic, and full of creative typos.", "model": "Qwen3-235B-A22B-Instruct-2507", "id": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5" }, "relationship": "Strangers who met in a 'Vintage Sci-Fi Book Club' Discord server.", "situation": "Elias posted a message asking for recommendations to overcome writer's block, and Luna, a fellow member, decided to DM him directly to offer some creative, non-traditional advice.", "topic": "I saw your post in the #writing- woes channel and had a few weird ideas that might help! Mind if I share?", "id": "c1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6" }, "chat_parts": [ { "sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5", "messages": [ "Hiii Elias! Saw your post in #writing-woes. I know the feeling (art block is the wooooorst 😭).", "Had a few maybe-weird ideas if you're open to them? ✨" ] }, { "sender": "4436437d368e4325a7c1c6f7092c2d9e", "messages": [ "<delay minutes=\"5\"/>", "Hello, Luna. I appreciate the outreach. At this point, I am receptive to any and all suggestions, regardless of their position on the conventionality spectrum." ] }, { "sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5", "messages": [ "Awesome! Okay, so forget writing. Just for a day. Go outside tonight and just... look at the stars. No pressure, just observe.", "Like you're cataloging them for a galactic library. What do they make you *feel*?", "<gif>animated gif of twinkling stars from a 90s anime</gif>" ] }, { "sender": "4436437d368e4325a7c1c6f7092c2d9e", "messages": [ "An interesting proposition. A purely observational, non-analytical exercise. It has a certain... elegance. I will attempt it.", "Thank you. <end/>" ] } ], "model": "Qwen3-235B-A22B-Instruct-2507" } ``` ### Data Fields Each JSON object contains the following fields: - **chat_id** (string): A unique identifier for the conversation. - **experience** (object): An object containing the full context for the conversation. - **persona1** & **persona2** (object): The complete persona objects from the SPB-2508 dataset for the two participants. - **relationship** (string): A brief description of how the two personas know each other. - **situation** (string): The specific online context or reason for the conversation starting. - **topic** (string): The opening line or subject that kicks off the dialogue. - **id** (string): A unique identifier for the experience object itself. - **chat_parts** (list[object]): A list of objects, where each object represents one turn in the conversation. - **sender** (string): The ID of the persona who sent the messages in this turn. - **messages** (list[string]): A list of one or more messages sent by the persona in this turn. Can include special XML-like tags. - **model** (string): The model used to generate the conversation. ### Data Splits The dataset is provided as a single file, which constitutes the `train` split. Users are encouraged to create their own validation and test splits as needed for their specific use cases. ## Dataset Creation ### Curation Rationale This dataset was created to address the need for large-scale, high-quality conversational data that goes beyond simple question-answering. The goal was to generate dialogues that exhibit deep persona consistency, natural topic progression, and the messy, imperfect nature of real online chats. By building directly on the structured `SPB-2508` persona bank, we ensure each conversation is grounded in a rich, pre-defined context. ### Source Data This is a synthetically generated dataset. Its primary source is the `marcodsn/SPB-2508` dataset, which provides the character personas. The conversational scenarios and dialogue were generated through a programmatic, multi-stage pipeline. ### Generation Process The conversations were generated using a three-stage pipeline: 1. **Stage 1: Experience Generation**: - Two personas were selected from the `SPB-2508` pool, with a weighting system to favor pairing personas of similar age, promoting more plausible interactions. - A relationship context (e.g., "old friends from college," "strangers in a gaming lobby") was dynamically constructed from seed components. - These elements were fed to the LLM, which generated a unique `situation` (the reason for the chat) and a starting `topic` (the opening line). Few-shot examples were used to encourage novelty. 2. **Stage 2: Conversational Rollout**: - Each generated "experience" served as a prompt for a new conversation. - The LLM generated the dialogue turn-by-turn, alternating between the two personas. - The prompt for each turn included the full persona details, the initial scenario, and the entire chat history up to that point. - The LLM was given a rich set of instructions to encourage realism, including: - **Human Imperfection**: Allowing for typos, topic drift, and variable effort in replies. - **Realistic Conflict**: Guiding the model to handle disagreements without immediate, clean resolutions. - **Special Tags**: Using XML-like tags to simulate online features: `<image>`, `<gif>`, `<audio>`, `<video>` for multimedia; `<delay>` to simulate response times; and `<end/>` to signal a natural conclusion to the chat. 3. **Stage 3: Post-Processing and Cleaning**: - The raw generated chats were collected and merged. - A cleaning script removed duplicates, filtered out conversations that were too short (fewer than two turns), and scrubbed artifacts like model-inserted speaker names (e.g., "Elias Vance: Hello"). - The final, cleaned dataset was shuffled to ensure random distribution. ## Known Limitations - **Synthetic Nature**: While designed for realism, the dialogues are synthetic and may not capture the full chaotic unpredictability of genuine human interaction. - **Inherited Bias**: Any biases, stereotypes, or patterns present in the source `SPB-2508` dataset will be inherited and potentially amplified in these conversations. - **Tag Frequency**: The use of special tags (`<image>`, `<delay>`, etc.) is not uniform across all conversations, as their inclusion was left to the model's discretion during generation. - **Conversation Endings**: The `<end/>` tag provides a clear signal but might lead to some conversations concluding more formulaically than they would in the wild. - **Instruction Following**: The LLM we used is not perfect and various problems derived from poor instruction following have been found; furthermore, the `<end/>` tag is often used prematurely. We will try to solve these issues in a future release. ## Additional Information ### Code and Seed Data The generation scripts and seed data can be found on [GitHub](https://github.com/marcodsn/SOC/tree/2508). ### Licensing Information This dataset is licensed under the CC BY 4.0 License. The code used to generate the dataset is available under the Apache 2.0 License. ### Citation Information If you use this dataset in your research, please consider citing it as follows: ```bibtex @misc{marcodsn_2025_SOC2508, title = {Synthetic Online Conversations}, author = {Marco De Santis}, year = {2025}, month = {August}, url = {https://huggingface.co/datasets/marcodsn/SOC-2508}, } ```

# 合成在线对话数据集卡片(Dataset Card for Synthetic Online Conversations) ![生成流水线(Generation Pipeline)](SPB_SOC_Pipeline.png) ## 数据集概览(Dataset Summary) 本数据集包含1180余条人工合成的多轮在线对话。每一条对话均为两名虚构角色间的完整交互,角色素材取自[Synthetic Persona Bank (SPB-2508)](https://huggingface.co/datasets/marcodsn/SPB-2508)数据集。 本数据集依托大语言模型(Large Language Model,LLM)[Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)驱动的多阶段程序化流水线(灵感源自[ConvoGen](https://huggingface.co/papers/2503.17460))构建而成。生成过程遵循详细指令,旨在生成自然、具备上下文感知能力且风格统一的对话,同时融入类人的不完美性、真实的冲突场景与模拟的多媒体元素。 与典型的合成对话不同,本数据集的对话显式建模了实时在线交互行为:参与者可在单轮中发送多条连续消息,回复间加入类人的延迟,并通过轻量级类XML标签(如`<image>`、`<gif>`、`<audio>`、`<video>`、`<delay>`、`<end/>`)“附加”模拟多媒体内容。这使得对话更贴近真实的即时通讯应用聊天串,而非单轮的机器人交互对话。 你可以通过[SOC可视化工具(SOC Visualizer)](https://huggingface.co/spaces/marcodsn/SOC_Visualizer) Hugging Face空间可视化生成的对话。 ## 数据集结构(Dataset Structure) 本数据集仅包含一个JSONL文件,文件中每一行均为代表一条完整对话的JSON对象。 ### 数据实例(Data Instances) 数据集中的每一行均为代表单条聊天的JSON对象,以下为聊天对象的示例: json { "chat_id": "4436437d368e4325a7c1c6f7092c2d9e_f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5_1754636647", "experience": { "persona1": { "name": "Elias Vance", "username": "quantum_scribe", "age": 42, "traits": ["analytical", "introspective", "witty", "reserved"], "background": "A theoretical physicist who, after a breakthrough, left academia to write science fiction novels from a secluded cabin. He's currently grappling with a severe case of writer's block for his second book.", "chatting_style": "Uses precise language and often employs metaphors from physics. Tends to write in well-structured, complete sentences, even in casual chat.", "model": "Qwen3-235B-A22B-Instruct-2507", "id": "4436437d368e4325a7c1c6f7092c2d9e" }, "persona2": { "name": "Luna Reyes", "username": "StardustSketcher", "age": 28, "traits": ["creative", "optimistic", "daydreamer", "empathetic"], "background": "A freelance digital artist who illustrates children's books and streams her drawing process online. She finds inspiration in mythology and the night sky.", "chatting_style": "Uses a lot of emojis and kaomoji (´。• ᵕ •。`). Her messages are often short, enthusiastic, and full of creative typos.", "model": "Qwen3-235B-A22B-Instruct-2507", "id": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5" }, "relationship": "Strangers who met in a 'Vintage Sci-Fi Book Club' Discord server.", "situation": "Elias posted a message asking for recommendations to overcome writer's block, and Luna, a fellow member, decided to DM him directly to offer some creative, non-traditional advice.", "topic": "I saw your post in the #writing- woes channel and had a few weird ideas that might help! Mind if I share?", "id": "c1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6" }, "chat_parts": [ { "sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5", "messages": [ "Hiii Elias! Saw your post in #writing-woes. I know the feeling (art block is the wooooorst 😭).", "Had a few maybe-weird ideas if you're open to them? ✨" ] }, { "sender": "4436437d368e4325a7c1c6f7092c2d9e", "messages": [ "<delay minutes="5"/>", "Hello, Luna. I appreciate the outreach. At this point, I am receptive to any and all suggestions, regardless of their position on the conventionality spectrum." ] }, { "sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5", "messages": [ "Awesome! Okay, so forget writing. Just for a day. Go outside tonight and just... look at the stars. No pressure, just observe.", "Like you're cataloging them for a galactic library. What do they make you *feel*?", "<gif>animated gif of twinkling stars from a 90s anime</gif>" ] }, { "sender": "4436437d368e4325a7c1c6f7092c2d9e", "messages": [ "An interesting proposition. A purely observational, non-analytical exercise. It has a certain... elegance. I will attempt it.", "Thank you. <end/>" ] } ], "model": "Qwen3-235B-A22B-Instruct-2507" } ### 数据字段(Data Fields) 每个JSON对象包含以下字段: - **chat_id**(字符串类型):对话的唯一标识符。 - **experience**(对象类型):包含对话完整上下文的对象。 - **persona1**与**persona2**(对象类型):两名参与者的完整角色对象,取自SPB-2508数据集。 - **relationship**(字符串类型):两名角色相识关系的简要描述。 - **situation**(字符串类型):对话发起的具体在线场景或缘由。 - **topic**(字符串类型):开启对话的开场白或主题。 - **id**(字符串类型):上下文对象自身的唯一标识符。 - **chat_parts**(对象列表类型):代表对话每一轮的对象列表。 - **sender**(字符串类型):当前轮次发送消息的角色ID。 - **messages**(字符串列表类型):当前轮次角色发送的一条或多条消息,可包含特殊类XML标签。 - **model**(字符串类型):用于生成对话的模型。 ### 数据划分(Data Splits) 本数据集仅提供单个文件,作为`train`训练划分集。鼓励用户根据自身具体需求,自行创建验证集与测试划分集。 ## 数据集构建(Dataset Creation) ### 筛选依据(Curation Rationale) 本数据集的构建旨在满足对大规模高质量对话数据的需求,这类数据需超越简单的问答式交互。其目标是生成具备深度角色一致性、自然主题演进,且贴合真实在线聊天杂乱不完美特质的对话。通过直接基于结构化的SPB-2508角色库构建,我们确保每条对话均依托于丰富的预定义上下文。 ### 源数据(Source Data) 本数据集为人工合成生成,其核心源数据为`marcodsn/SPB-2508`数据集,该数据集提供了角色素材。对话场景与交互内容通过多阶段程序化流水线生成。 ### 生成流程(Generation Process) 对话通过三阶段流水线生成: 1. **阶段1:上下文生成(Experience Generation)**: - 从SPB-2508角色库中选取两名角色,采用加权机制优先匹配年龄相近的角色,以提升交互的合理性。 - 从种子组件中动态构建角色相识关系(例如“大学旧友”、“游戏大厅陌生人”)。 - 将上述元素输入大语言模型,由其生成独特的对话发起场景(`situation`,即聊天缘由)与初始主题(`topic`,即开场白)。采用少样本(Few-shot)示例以鼓励内容的新颖性。 2. **阶段2:对话展开(Conversational Rollout)**: - 每条生成的“上下文”作为新对话的提示词。 - 大语言模型逐轮生成对话,在两名角色间交替切换。 - 每一轮的提示词包含完整的角色细节、初始场景以及截至当前的完整聊天历史。 - 为大语言模型提供了丰富的指令集以提升对话真实性,具体包括: - **类人不完美性**:允许出现拼写错误、主题偏移以及回复精力差异等情况。 - **真实冲突**:引导模型处理分歧时不采用直接且完美的解决方案。 - **特殊标签**:使用类XML标签模拟在线交互功能:`<image>`、`<gif>`、`<audio>`、`<video>`用于多媒体内容;`<delay>`用于模拟回复延迟;`<end/>`用于标记对话自然结束。 3. **阶段3:后处理与清洗(Post-Processing and Cleaning)**: - 收集并合并原始生成的聊天内容。 - 通过清洗脚本移除重复内容,过滤过短(轮次少于2轮)的对话,并清理模型插入的说话人名称等人工痕迹,例如“Elias Vance: Hello”。 - 对最终清洗后的数据集进行洗牌,以确保随机分布。 ## 已知局限性(Known Limitations) - **合成属性**:尽管旨在贴近真实,但对话仍为合成生成,可能无法完全复刻真实人类交互的混乱不可预测性。 - **继承偏差**:源数据集SPB-2508中存在的任何偏差、刻板印象或模式,都将被继承并可能在本对话集中被放大。 - **标签使用频率不均**:特殊标签(如`<image>`、`<delay>`等)的使用在所有对话中并不统一,因为生成过程中是否使用标签由模型自主决定。 - **对话收尾**:`<end/>`标签提供了明确的结束信号,但可能导致部分对话的收尾比真实场景更程序化。 - **指令遵循能力**:我们使用的大语言模型并非完美,已发现多种因指令遵循不佳导致的问题;此外,`<end/>`标签常被过早使用。我们将在未来版本中尝试解决这些问题。 ## 补充信息(Additional Information) ### 代码与种子数据(Code and Seed Data) 生成脚本与种子数据可在[GitHub](https://github.com/marcodsn/SOC/tree/2508)获取。 ### 授权信息(Licensing Information) 本数据集采用CC BY 4.0协议授权。 用于生成数据集的代码采用Apache 2.0协议授权。 ### 引用信息(Citation Information) 若您在研究中使用本数据集,请按以下方式引用: bibtex @misc{marcodsn_2025_SOC2508, title = {Synthetic Online Conversations}, author = {Marco De Santis}, year = {2025}, month = {August}, url = {https://huggingface.co/datasets/marcodsn/SOC-2508}, }
提供机构:
maas
创建时间:
2025-08-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作