SOC-2508
收藏魔搭社区2025-12-05 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/SOC-2508
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Synthetic Online Conversations

## Dataset Summary
This dataset contains over 1,180 synthetically generated, multi-turn online conversations. Each conversation is a complete dialogue between two fictional personas drawn from the [Synthetic Persona Bank (SPB-2508)](https://huggingface.co/datasets/marcodsn/SPB-2508) dataset.
The dataset was created using a multi-stage programmatic pipeline (inspired by [ConvoGen](https://huggingface.co/papers/2503.17460)) driven by a large language model ([Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)). The generation process was guided by detailed instructions to produce natural, context-aware, and stylistically consistent dialogues with human-like imperfections, realistic conflict, and simulated multimedia elements.
Unlike typical synthetic chats, these conversations explicitly model real-time online behavior: participants may send multiple consecutive messages per turn, include human-like delays between replies, and “attach” simulated multimedia using lightweight XML-like tags (e.g., `<image>`, `<gif>`, `<audio>`, `<video>`, `<delay>`, `<end/>`). This makes the conversations feel more like actual messaging app threads rather than single-turn exchanges with a chatbot.
You can visualize the generated conversations using the [SOC Visualizer](https://huggingface.co/spaces/marcodsn/SOC_Visualizer) HF Space.
## Dataset Structure
The dataset consists of a single JSONL file where each line is a JSON object representing a complete conversation.
### Data Instances
Each line in the dataset is a JSON object representing a single chat. Here is an example of what a chat object looks like:
```json
{
"chat_id": "4436437d368e4325a7c1c6f7092c2d9e_f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5_1754636647",
"experience": {
"persona1": {
"name": "Elias Vance",
"username": "quantum_scribe",
"age": 42,
"traits": ["analytical", "introspective", "witty", "reserved"],
"background": "A theoretical physicist who, after a breakthrough, left academia to write science fiction novels from a secluded cabin. He's currently grappling with a severe case of writer's block for his second book.",
"chatting_style": "Uses precise language and often employs metaphors from physics. Tends to write in well-structured, complete sentences, even in casual chat.",
"model": "Qwen3-235B-A22B-Instruct-2507",
"id": "4436437d368e4325a7c1c6f7092c2d9e"
},
"persona2": {
"name": "Luna Reyes",
"username": "StardustSketcher",
"age": 28,
"traits": ["creative", "optimistic", "daydreamer", "empathetic"],
"background": "A freelance digital artist who illustrates children's books and streams her drawing process online. She finds inspiration in mythology and the night sky.",
"chatting_style": "Uses a lot of emojis and kaomoji (´。• ᵕ •。`). Her messages are often short, enthusiastic, and full of creative typos.",
"model": "Qwen3-235B-A22B-Instruct-2507",
"id": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5"
},
"relationship": "Strangers who met in a 'Vintage Sci-Fi Book Club' Discord server.",
"situation": "Elias posted a message asking for recommendations to overcome writer's block, and Luna, a fellow member, decided to DM him directly to offer some creative, non-traditional advice.",
"topic": "I saw your post in the #writing- woes channel and had a few weird ideas that might help! Mind if I share?",
"id": "c1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6"
},
"chat_parts": [
{
"sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5",
"messages": [
"Hiii Elias! Saw your post in #writing-woes. I know the feeling (art block is the wooooorst 😭).",
"Had a few maybe-weird ideas if you're open to them? ✨"
]
},
{
"sender": "4436437d368e4325a7c1c6f7092c2d9e",
"messages": [
"<delay minutes=\"5\"/>",
"Hello, Luna. I appreciate the outreach. At this point, I am receptive to any and all suggestions, regardless of their position on the conventionality spectrum."
]
},
{
"sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5",
"messages": [
"Awesome! Okay, so forget writing. Just for a day. Go outside tonight and just... look at the stars. No pressure, just observe.",
"Like you're cataloging them for a galactic library. What do they make you *feel*?",
"<gif>animated gif of twinkling stars from a 90s anime</gif>"
]
},
{
"sender": "4436437d368e4325a7c1c6f7092c2d9e",
"messages": [
"An interesting proposition. A purely observational, non-analytical exercise. It has a certain... elegance. I will attempt it.",
"Thank you. <end/>"
]
}
],
"model": "Qwen3-235B-A22B-Instruct-2507"
}
```
### Data Fields
Each JSON object contains the following fields:
- **chat_id** (string): A unique identifier for the conversation.
- **experience** (object): An object containing the full context for the conversation.
- **persona1** & **persona2** (object): The complete persona objects from the SPB-2508 dataset for the two participants.
- **relationship** (string): A brief description of how the two personas know each other.
- **situation** (string): The specific online context or reason for the conversation starting.
- **topic** (string): The opening line or subject that kicks off the dialogue.
- **id** (string): A unique identifier for the experience object itself.
- **chat_parts** (list[object]): A list of objects, where each object represents one turn in the conversation.
- **sender** (string): The ID of the persona who sent the messages in this turn.
- **messages** (list[string]): A list of one or more messages sent by the persona in this turn. Can include special XML-like tags.
- **model** (string): The model used to generate the conversation.
### Data Splits
The dataset is provided as a single file, which constitutes the `train` split. Users are encouraged to create their own validation and test splits as needed for their specific use cases.
## Dataset Creation
### Curation Rationale
This dataset was created to address the need for large-scale, high-quality conversational data that goes beyond simple question-answering. The goal was to generate dialogues that exhibit deep persona consistency, natural topic progression, and the messy, imperfect nature of real online chats. By building directly on the structured `SPB-2508` persona bank, we ensure each conversation is grounded in a rich, pre-defined context.
### Source Data
This is a synthetically generated dataset. Its primary source is the `marcodsn/SPB-2508` dataset, which provides the character personas. The conversational scenarios and dialogue were generated through a programmatic, multi-stage pipeline.
### Generation Process
The conversations were generated using a three-stage pipeline:
1. **Stage 1: Experience Generation**:
- Two personas were selected from the `SPB-2508` pool, with a weighting system to favor pairing personas of similar age, promoting more plausible interactions.
- A relationship context (e.g., "old friends from college," "strangers in a gaming lobby") was dynamically constructed from seed components.
- These elements were fed to the LLM, which generated a unique `situation` (the reason for the chat) and a starting `topic` (the opening line). Few-shot examples were used to encourage novelty.
2. **Stage 2: Conversational Rollout**:
- Each generated "experience" served as a prompt for a new conversation.
- The LLM generated the dialogue turn-by-turn, alternating between the two personas.
- The prompt for each turn included the full persona details, the initial scenario, and the entire chat history up to that point.
- The LLM was given a rich set of instructions to encourage realism, including:
- **Human Imperfection**: Allowing for typos, topic drift, and variable effort in replies.
- **Realistic Conflict**: Guiding the model to handle disagreements without immediate, clean resolutions.
- **Special Tags**: Using XML-like tags to simulate online features: `<image>`, `<gif>`, `<audio>`, `<video>` for multimedia; `<delay>` to simulate response times; and `<end/>` to signal a natural conclusion to the chat.
3. **Stage 3: Post-Processing and Cleaning**:
- The raw generated chats were collected and merged.
- A cleaning script removed duplicates, filtered out conversations that were too short (fewer than two turns), and scrubbed artifacts like model-inserted speaker names (e.g., "Elias Vance: Hello").
- The final, cleaned dataset was shuffled to ensure random distribution.
## Known Limitations
- **Synthetic Nature**: While designed for realism, the dialogues are synthetic and may not capture the full chaotic unpredictability of genuine human interaction.
- **Inherited Bias**: Any biases, stereotypes, or patterns present in the source `SPB-2508` dataset will be inherited and potentially amplified in these conversations.
- **Tag Frequency**: The use of special tags (`<image>`, `<delay>`, etc.) is not uniform across all conversations, as their inclusion was left to the model's discretion during generation.
- **Conversation Endings**: The `<end/>` tag provides a clear signal but might lead to some conversations concluding more formulaically than they would in the wild.
- **Instruction Following**: The LLM we used is not perfect and various problems derived from poor instruction following have been found; furthermore, the `<end/>` tag is often used prematurely. We will try to solve these issues in a future release.
## Additional Information
### Code and Seed Data
The generation scripts and seed data can be found on [GitHub](https://github.com/marcodsn/SOC/tree/2508).
### Licensing Information
This dataset is licensed under the CC BY 4.0 License.
The code used to generate the dataset is available under the Apache 2.0 License.
### Citation Information
If you use this dataset in your research, please consider citing it as follows:
```bibtex
@misc{marcodsn_2025_SOC2508,
title = {Synthetic Online Conversations},
author = {Marco De Santis},
year = {2025},
month = {August},
url = {https://huggingface.co/datasets/marcodsn/SOC-2508},
}
```
# 合成在线对话数据集卡片(Dataset Card for Synthetic Online Conversations)

## 数据集概览(Dataset Summary)
本数据集包含1180余条人工合成的多轮在线对话。每一条对话均为两名虚构角色间的完整交互,角色素材取自[Synthetic Persona Bank (SPB-2508)](https://huggingface.co/datasets/marcodsn/SPB-2508)数据集。
本数据集依托大语言模型(Large Language Model,LLM)[Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)驱动的多阶段程序化流水线(灵感源自[ConvoGen](https://huggingface.co/papers/2503.17460))构建而成。生成过程遵循详细指令,旨在生成自然、具备上下文感知能力且风格统一的对话,同时融入类人的不完美性、真实的冲突场景与模拟的多媒体元素。
与典型的合成对话不同,本数据集的对话显式建模了实时在线交互行为:参与者可在单轮中发送多条连续消息,回复间加入类人的延迟,并通过轻量级类XML标签(如`<image>`、`<gif>`、`<audio>`、`<video>`、`<delay>`、`<end/>`)“附加”模拟多媒体内容。这使得对话更贴近真实的即时通讯应用聊天串,而非单轮的机器人交互对话。
你可以通过[SOC可视化工具(SOC Visualizer)](https://huggingface.co/spaces/marcodsn/SOC_Visualizer) Hugging Face空间可视化生成的对话。
## 数据集结构(Dataset Structure)
本数据集仅包含一个JSONL文件,文件中每一行均为代表一条完整对话的JSON对象。
### 数据实例(Data Instances)
数据集中的每一行均为代表单条聊天的JSON对象,以下为聊天对象的示例:
json
{
"chat_id": "4436437d368e4325a7c1c6f7092c2d9e_f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5_1754636647",
"experience": {
"persona1": {
"name": "Elias Vance",
"username": "quantum_scribe",
"age": 42,
"traits": ["analytical", "introspective", "witty", "reserved"],
"background": "A theoretical physicist who, after a breakthrough, left academia to write science fiction novels from a secluded cabin. He's currently grappling with a severe case of writer's block for his second book.",
"chatting_style": "Uses precise language and often employs metaphors from physics. Tends to write in well-structured, complete sentences, even in casual chat.",
"model": "Qwen3-235B-A22B-Instruct-2507",
"id": "4436437d368e4325a7c1c6f7092c2d9e"
},
"persona2": {
"name": "Luna Reyes",
"username": "StardustSketcher",
"age": 28,
"traits": ["creative", "optimistic", "daydreamer", "empathetic"],
"background": "A freelance digital artist who illustrates children's books and streams her drawing process online. She finds inspiration in mythology and the night sky.",
"chatting_style": "Uses a lot of emojis and kaomoji (´。• ᵕ •。`). Her messages are often short, enthusiastic, and full of creative typos.",
"model": "Qwen3-235B-A22B-Instruct-2507",
"id": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5"
},
"relationship": "Strangers who met in a 'Vintage Sci-Fi Book Club' Discord server.",
"situation": "Elias posted a message asking for recommendations to overcome writer's block, and Luna, a fellow member, decided to DM him directly to offer some creative, non-traditional advice.",
"topic": "I saw your post in the #writing- woes channel and had a few weird ideas that might help! Mind if I share?",
"id": "c1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6"
},
"chat_parts": [
{
"sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5",
"messages": [
"Hiii Elias! Saw your post in #writing-woes. I know the feeling (art block is the wooooorst 😭).",
"Had a few maybe-weird ideas if you're open to them? ✨"
]
},
{
"sender": "4436437d368e4325a7c1c6f7092c2d9e",
"messages": [
"<delay minutes="5"/>",
"Hello, Luna. I appreciate the outreach. At this point, I am receptive to any and all suggestions, regardless of their position on the conventionality spectrum."
]
},
{
"sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5",
"messages": [
"Awesome! Okay, so forget writing. Just for a day. Go outside tonight and just... look at the stars. No pressure, just observe.",
"Like you're cataloging them for a galactic library. What do they make you *feel*?",
"<gif>animated gif of twinkling stars from a 90s anime</gif>"
]
},
{
"sender": "4436437d368e4325a7c1c6f7092c2d9e",
"messages": [
"An interesting proposition. A purely observational, non-analytical exercise. It has a certain... elegance. I will attempt it.",
"Thank you. <end/>"
]
}
],
"model": "Qwen3-235B-A22B-Instruct-2507"
}
### 数据字段(Data Fields)
每个JSON对象包含以下字段:
- **chat_id**(字符串类型):对话的唯一标识符。
- **experience**(对象类型):包含对话完整上下文的对象。
- **persona1**与**persona2**(对象类型):两名参与者的完整角色对象,取自SPB-2508数据集。
- **relationship**(字符串类型):两名角色相识关系的简要描述。
- **situation**(字符串类型):对话发起的具体在线场景或缘由。
- **topic**(字符串类型):开启对话的开场白或主题。
- **id**(字符串类型):上下文对象自身的唯一标识符。
- **chat_parts**(对象列表类型):代表对话每一轮的对象列表。
- **sender**(字符串类型):当前轮次发送消息的角色ID。
- **messages**(字符串列表类型):当前轮次角色发送的一条或多条消息,可包含特殊类XML标签。
- **model**(字符串类型):用于生成对话的模型。
### 数据划分(Data Splits)
本数据集仅提供单个文件,作为`train`训练划分集。鼓励用户根据自身具体需求,自行创建验证集与测试划分集。
## 数据集构建(Dataset Creation)
### 筛选依据(Curation Rationale)
本数据集的构建旨在满足对大规模高质量对话数据的需求,这类数据需超越简单的问答式交互。其目标是生成具备深度角色一致性、自然主题演进,且贴合真实在线聊天杂乱不完美特质的对话。通过直接基于结构化的SPB-2508角色库构建,我们确保每条对话均依托于丰富的预定义上下文。
### 源数据(Source Data)
本数据集为人工合成生成,其核心源数据为`marcodsn/SPB-2508`数据集,该数据集提供了角色素材。对话场景与交互内容通过多阶段程序化流水线生成。
### 生成流程(Generation Process)
对话通过三阶段流水线生成:
1. **阶段1:上下文生成(Experience Generation)**:
- 从SPB-2508角色库中选取两名角色,采用加权机制优先匹配年龄相近的角色,以提升交互的合理性。
- 从种子组件中动态构建角色相识关系(例如“大学旧友”、“游戏大厅陌生人”)。
- 将上述元素输入大语言模型,由其生成独特的对话发起场景(`situation`,即聊天缘由)与初始主题(`topic`,即开场白)。采用少样本(Few-shot)示例以鼓励内容的新颖性。
2. **阶段2:对话展开(Conversational Rollout)**:
- 每条生成的“上下文”作为新对话的提示词。
- 大语言模型逐轮生成对话,在两名角色间交替切换。
- 每一轮的提示词包含完整的角色细节、初始场景以及截至当前的完整聊天历史。
- 为大语言模型提供了丰富的指令集以提升对话真实性,具体包括:
- **类人不完美性**:允许出现拼写错误、主题偏移以及回复精力差异等情况。
- **真实冲突**:引导模型处理分歧时不采用直接且完美的解决方案。
- **特殊标签**:使用类XML标签模拟在线交互功能:`<image>`、`<gif>`、`<audio>`、`<video>`用于多媒体内容;`<delay>`用于模拟回复延迟;`<end/>`用于标记对话自然结束。
3. **阶段3:后处理与清洗(Post-Processing and Cleaning)**:
- 收集并合并原始生成的聊天内容。
- 通过清洗脚本移除重复内容,过滤过短(轮次少于2轮)的对话,并清理模型插入的说话人名称等人工痕迹,例如“Elias Vance: Hello”。
- 对最终清洗后的数据集进行洗牌,以确保随机分布。
## 已知局限性(Known Limitations)
- **合成属性**:尽管旨在贴近真实,但对话仍为合成生成,可能无法完全复刻真实人类交互的混乱不可预测性。
- **继承偏差**:源数据集SPB-2508中存在的任何偏差、刻板印象或模式,都将被继承并可能在本对话集中被放大。
- **标签使用频率不均**:特殊标签(如`<image>`、`<delay>`等)的使用在所有对话中并不统一,因为生成过程中是否使用标签由模型自主决定。
- **对话收尾**:`<end/>`标签提供了明确的结束信号,但可能导致部分对话的收尾比真实场景更程序化。
- **指令遵循能力**:我们使用的大语言模型并非完美,已发现多种因指令遵循不佳导致的问题;此外,`<end/>`标签常被过早使用。我们将在未来版本中尝试解决这些问题。
## 补充信息(Additional Information)
### 代码与种子数据(Code and Seed Data)
生成脚本与种子数据可在[GitHub](https://github.com/marcodsn/SOC/tree/2508)获取。
### 授权信息(Licensing Information)
本数据集采用CC BY 4.0协议授权。
用于生成数据集的代码采用Apache 2.0协议授权。
### 引用信息(Citation Information)
若您在研究中使用本数据集,请按以下方式引用:
bibtex
@misc{marcodsn_2025_SOC2508,
title = {Synthetic Online Conversations},
author = {Marco De Santis},
year = {2025},
month = {August},
url = {https://huggingface.co/datasets/marcodsn/SOC-2508},
}
提供机构:
maas
创建时间:
2025-08-06



