five

SOC-2508-MULTI

收藏
魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/SOC-2508-MULTI
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Multilingual Synthetic Online Conversations ![Generation Pipeline](SPB_SOC_Pipeline.png) ## Dataset Summary This dataset contains multilingual translations of the [Synthetic Online Conversations (SOC-2508)](https://huggingface.co/datasets/marcodsn/SOC-2508) dataset. Each conversation from the original dataset has been translated into French, Italian, German, Spanish, providing over 1,180 synthetically generated, multi-turn online conversations in multiple languages. The translations were generated using [google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it) with **vLLM** as the inference backend through HuggingFace Jobs (check [this wonderful blog post](https://danielvanstrien.xyz/posts/2025/hf-jobs/vllm-batch-inference.html)). The translation script is available [on Github](https://github.com/marcodsn/SOC/blob/2508/scripts/translation/translate.uv.py). Each conversation preserves the complete dialogue between two fictional personas, including their detailed backgrounds, relationship dynamics, and conversation context, but now available in more languages. Special multimedia tags (e.g., `<image>`, `<audio>`) are preserved to maintain the authentic online conversation experience. > [!Important] > For details about the original generation process, data fields, and design goals, see the seed dataset page: [Synthetic Online Conversations (SOC-2508)](https://huggingface.co/datasets/marcodsn/SOC-2508). ## Dataset Structure The dataset consists of a single split where each item is a JSON object representing a complete conversation with multilingual content. ### Data Instances Each line in the dataset is a JSON object representing a single chat with multilingual fields. Here is an example of what a chat object looks like: ```json { "chat_id": "4436437d368e4325a7c1c6f7092c2d9e_f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5_1754636647", "experience": { "persona1": { "name": "Elias Vance", "username": "quantum_scribe", "age": 42, "traits": { "en": ["analytical", "introspective", "witty", "reserved"], "fr": ["analytique", "introspectif", "spirituel", "réservé"] }, "background": { "en": "A theoretical physicist who, after a breakthrough, left academia to write science fiction novels from a secluded cabin. He's currently grappling with a severe case of writer's block for his second book.", "fr": "Un physicien théoricien qui, après une percée, a quitté le monde académique pour écrire des romans de science-fiction depuis une cabane isolée. Il lutte actuellement contre un grave blocage d'écrivain pour son deuxième livre." }, "chatting_style": { "en": "Uses precise language and often employs metaphors from physics. Tends to write in well-structured, complete sentences, even in casual chat.", "fr": "Utilise un langage précis et emploie souvent des métaphores de la physique. A tendance à écrire en phrases bien structurées et complètes, même dans une discussion décontractée." }, "model": "Qwen3-235B-A22B-Instruct-2507", "id": "4436437d368e4325a7c1c6f7092c2d9e" }, "persona2": { "name": "Luna Reyes", "username": "StardustSketcher", "age": 28, "traits": { "en": ["creative", "optimistic", "daydreamer", "empathetic"], "fr": ["créative", "optimiste", "rêveuse", "empathique"] }, "background": { "en": "A freelance digital artist who illustrates children's books and streams her drawing process online. She finds inspiration in mythology and the night sky.", "fr": "Une artiste numérique freelance qui illustre des livres pour enfants et diffuse son processus de dessin en ligne. Elle trouve son inspiration dans la mythologie et le ciel nocturne." }, "chatting_style": { "en": "Uses a lot of emojis and kaomoji (´。- ᵕ - 。`). Her messages are often short, enthusiastic, and full of creative typos.", "fr": "Utilise beaucoup d'emojis et de kaomoji (´。- ᵕ - 。`). Ses messages sont souvent courts, enthousiastes et pleins de fautes de frappe créatives." }, "model": "Qwen3-235B-A22B-Instruct-2507", "id": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5" }, "relationship": { "en": "Strangers who met in a 'Vintage Sci-Fi Book Club' Discord server.", "fr": "Des inconnus qui se sont rencontrés dans un serveur Discord 'Club de Livres de Science-Fiction Vintage'." }, "situation": { "en": "Elias posted a message asking for recommendations to overcome writer's block, and Luna, a fellow member, decided to DM him directly to offer some creative, non-traditional advice.", "fr": "Elias a posté un message demandant des recommandations pour surmonter le blocage de l'écrivain, et Luna, un membre du groupe, a décidé de lui envoyer un message privé pour offrir des conseils créatifs et non conventionnels." }, "topic": { "en": "I saw your post in the #writing- woes channel and had a few weird ideas that might help! Mind if I share?", "fr": "J'ai vu ton post dans le canal #writing-woes et j'ai eu quelques idées bizarres qui pourraient t'aider ! Ça te dérange si je les partage ?" }, "id": "c1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6" }, "chat_parts": [ { "sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5", "messages": { "en": [ "Hiii Elias! Saw your post in #writing-woes. I know the feeling (art block is the wooooorst 😭).", "Had a few maybe-weird ideas if you're open to them? ✨" ], "fr": [ "Salut Elias ! J'ai vu ton post dans #writing-woes. Je connais ce sentiment (le blocage artistique c'est le piiiiire 😭).", "J'ai eu quelques idées peut-être bizarres si tu es ouvert à ça ? ✨" ] } }, { "sender": "4436437d368e4325a7c1c6f7092c2d9e", "messages": { "en": [ "Hello, Luna. I appreciate the outreach. At this point, I am receptive to any and all suggestions, regardless of their position on the conventionality spectrum." ], "fr": [ "Bonjour, Luna. J'apprécie ta démarche. À ce stade, je suis réceptif à toutes les suggestions, peu importe leur position sur le spectre de la conventionnalité." ] } } ], "model": "Qwen3-235B-A22B-Instruct-2507" } ``` ### Data Fields Each JSON object contains the same structure as the original SOC-2508 dataset, but with multilingual content: - **chat_id** (string): A unique identifier for the conversation. - **experience** (object): An object containing the full multilingual context for the conversation. - **persona1** & **persona2** (object): The complete persona objects with multilingual fields: - **traits** (object): Character traits in multiple languages - **background** (object): Character background in multiple languages - **chatting_style** (object): Communication style descriptions in multiple languages - **relationship** (object): Relationship description in multiple languages - **situation** (object): Scenario description in multiple languages - **topic** (object): Opening line in multiple languages - **id** (string): A unique identifier for the experience object itself. - **chat_parts** (list[object]): A list of conversation turns with multilingual messages. - **sender** (string): The ID of the persona who sent the messages in this turn. - **messages** (object): Messages organized by language code, preserving special XML-like tags. - **model** (string): The original model used to generate the English conversation. ### Language Codes This dataset uses ISO 639-1 language codes: - **en**: English (original) - **fr**: Français (French) - **it**: Italiano (Italian) - **de**: Deutsch (German) - **es**: Español (Spanish) ### Data Splits The dataset is provided as a single `train` split. Users are encouraged to create their own validation and test splits as needed for their specific use cases. ## Dataset Creation ### Curation Rationale This multilingual version was created to extend the reach and utility of the original SOC-2508 dataset across different language communities. By providing high-quality translations that preserve the nuanced persona-based dialogue structure, this dataset enables research and development of multilingual conversational AI systems that can maintain persona consistency and natural dialogue flow across languages. ### Source Data This dataset is built upon the synthetically generated [SOC-2508](https://huggingface.co/datasets/marcodsn/SOC-2508) dataset. The original English conversations were translated using automated translation methods while preserving the structure and special formatting elements. ### Translation Process The translations were generated using the following pipeline: 1. **Model Selection**: **google/gemma-3n-E4B-it** was selected as the translation model for its strong multilingual capabilities / efficency ratio. 2. **Infrastructure**: **vLLM** served as the inference backend, deployed through Hugging Face Jobs for efficient (and super fast!) batch processing. 3. **Field-by-Field Translation**: Each multilingual field (persona traits, backgrounds, chat messages, etc.) was translated individually to prevent formatting errors with the small translation model. 4. **Special Tag Preservation**: XML-like tags (`<audio><audio/>`, `<image><image/>`, `<delay/>`, etc.) were preserved in their original form to maintain the multimedia conversation experience across languages. 5. **Quality Assurance**: Post-processing steps ensured structural integrity and consistent formatting across all language versions. ## Usage ``` from datasets import load_dataset dataset = load_dataset("marcodsn/SOC-2508-MULTI") # Access English version english_background = dataset["experience"]["persona1"]["background"]["en"] # Access Italian version italian_background = dataset["experience"]["persona1"]["background"]["it"] # Access multilingual messages english_messages = dataset["chat_parts"]["messages"]["en"] italian_messages = dataset["chat_parts"]["messages"]["it"] ``` ## Known Limitations - **Translation Quality**: While generated using a capable model, automated translations may not capture all linguistic nuances, cultural references, or idiomatic expressions perfectly. Furthermore, in this revision the model did not have full conversation context, so some translated messages may sound off. - **Inherited Limitations**: All limitations from the original SOC-2508 dataset apply, including synthetic nature and potential biases. - **Cultural Adaptation**: Translations are linguistic rather than cultural adaptations, so some references may not translate meaningfully across cultural contexts. ## Additional Information ### Original Dataset This multilingual version is based on the [SOC-2508](https://huggingface.co/datasets/marcodsn/SOC-2508) dataset. Please refer to the original dataset card for detailed information about the generation methodology and underlying persona bank. ### Licensing Information This dataset is licensed under the CC BY 4.0 License, maintaining consistency with the original dataset. ### Citation Information If you use this dataset in your research, please consider citing it as follows: ```bibtex @misc{marcodsn_2025_SOC2508_MULTI, title = {Multilingual Synthetic Online Conversations}, author = {Marco De Santis}, year = {2025}, month = {August}, url = {https://huggingface.co/datasets/marcodsn/SOC-2508-MULTI}, } ```

# 多语言合成在线对话数据集卡片(Dataset Card) ![Generation Pipeline](SPB_SOC_Pipeline.png) ## 数据集摘要(Dataset Summary) 本数据集包含[合成在线对话(Synthetic Online Conversations, SOC-2508)](https://huggingface.co/datasets/marcodsn/SOC-2508)数据集的多语言翻译版本。原数据集中的每一段对话均被翻译为法语、意大利语、德语与西班牙语,最终得到超过1180段多语言合成式多轮在线对话。 本次翻译通过[google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it)完成,以**vLLM**作为推理后端,并通过HuggingFace Jobs部署(可参阅[这篇优质博客文章](https://danielvanstrien.xyz/posts/2025/hf-jobs/vllm-batch-inference.html))。翻译脚本可在[Github](https://github.com/marcodsn/SOC/blob/2508/scripts/translation/translate.uv.py)获取。 每段对话均保留了两名虚构角色间的完整交互,包括其详细背景、互动关系与对话语境,现支持更多语言版本。特殊多媒体标签(如`<image>`、`<audio>`)均被保留,以还原真实的在线对话体验。 > 【重要提示】 > 如需了解原始生成流程、数据字段与设计目标的详细信息,请参阅种子数据集页面:[合成在线对话(Synthetic Online Conversations, SOC-2508)](https://huggingface.co/datasets/marcodsn/SOC-2508)。 ## 数据集结构(Dataset Structure) 本数据集仅包含一个拆分,其中每一条数据均为代表完整多语言对话的JSON对象。 ### 数据实例(Data Instances) 数据集中的每一行均为代表单条多字段多语言聊天的JSON对象。以下为聊天对象的示例结构: json { "chat_id": "4436437d368e4325a7c1c6f7092c2d9e_f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5_1754636647", "experience": { "persona1": { "name": "Elias Vance", "username": "quantum_scribe", "age": 42, "traits": { "en": ["analytical", "introspective", "witty", "reserved"], "fr": ["analytique", "introspectif", "spirituel", "réservé"] }, "background": { "en": "A theoretical physicist who, after a breakthrough, left academia to write science fiction novels from a secluded cabin. He's currently grappling with a severe case of writer's block for his second book.", "fr": "Un physicien théoricien qui, après une percée, a quitté le monde académique pour écrire des romans de science-fiction depuis une cabane isolée. Il lutte actuellement contre un grave blocage d'écrivain pour son deuxième livre." }, "chatting_style": { "en": "Uses precise language and often employs metaphors from physics. Tends to write in well-structured, complete sentences, even in casual chat.", "fr": "Utilise un langage précis et emploie souvent des métaphores de la physique. A tendance à écrire en phrases bien structurées et complètes, même dans une discussion décontractée." }, "model": "Qwen3-235B-A22B-Instruct-2507", "id": "4436437d368e4325a7c1c6f7092c2d9e" }, "persona2": { "name": "Luna Reyes", "username": "StardustSketcher", "age": 28, "traits": { "en": ["creative", "optimistic", "daydreamer", "empathetic"], "fr": ["créative", "optimiste", "rêveuse", "empathique"] }, "background": { "en": "A freelance digital artist who illustrates children's books and streams her drawing process online. She finds inspiration in mythology and the night sky.", "fr": "Une artiste numérique freelance qui illustre des livres pour enfants et diffuse son processus de dessin en ligne. Elle trouve son inspiration dans la mythologie et le ciel nocturne." }, "chatting_style": { "en": "Uses a lot of emojis and kaomoji (´。- ᵕ - 。`). Her messages are often short, enthusiastic, and full of creative typos.", "fr": "Utilise beaucoup d'emojis et de kaomoji (´。- ᵕ - 。`). Ses messages sont souvent courts, enthousiastes et pleins de fautes de frappe créatives." }, "model": "Qwen3-235B-A22B-Instruct-2507", "id": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5" }, "relationship": { "en": "Strangers who met in a 'Vintage Sci-Fi Book Club' Discord server.", "fr": "Des inconnus qui se sont rencontrés dans un serveur Discord 'Club de Livres de Science-Fiction Vintage'." }, "situation": { "en": "Elias posted a message asking for recommendations to overcome writer's block, and Luna, a fellow member, decided to DM him directly to offer some creative, non-traditional advice.", "fr": "Elias a posté un message demandant des recommandations pour surmonter le blocage de l'écrivain, et Luna, un membre du groupe, a décidé de lui envoyer un message privé pour offrir des conseils créatifs et non conventionnels." }, "topic": { "en": "I saw your post in the #writing- woes channel and had a few weird ideas that might help! Mind if I share?", "fr": "J'ai vu ton post dans le canal #writing-woes et j'ai eu quelques idées bizarres qui pourraient t'aider ! Ça te dérange si je les partage ?" }, "id": "c1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6" }, "chat_parts": [ { "sender": "f8e1b2a3c4d5e6f7g8h9i0j1k2l3m4n5", "messages": { "en": [ "Hiii Elias! Saw your post in #writing-woes. I know the feeling (art block is the wooooorst 😭).", "Had a few maybe-weird ideas if you're open to them? ✨" ], "fr": [ "Salut Elias ! J'ai vu ton post dans #writing-woes. Je connais ce sentiment (le blocage artistique c'est le piiiiire 😭).", "J'ai eu quelques idées peut-être bizarres si tu es ouvert à ça ? ✨" ] } }, { "sender": "4436437d368e4325a7c1c6f7092c2d9e", "messages": { "en": [ "Hello, Luna. I appreciate the outreach. At this point, I am receptive to any and all suggestions, regardless of their position on the conventionality spectrum." ], "fr": [ "Bonjour, Luna. J'apprécie ta démarche. À ce stade, je suis réceptif à toutes les suggestions, peu importe leur position sur le spectre de la conventionnalité." ] } } ], "model": "Qwen3-235B-A22B-Instruct-2507" } ### 数据字段(Data Fields) 每个JSON对象的结构与原始SOC-2508数据集一致,但包含多语言内容: - **chat_id**(字符串):对话的唯一标识符。 - **experience**(对象):包含对话完整多语言上下文的对象。 - **persona1**与**persona2**(对象):包含多语言字段的完整角色对象: - **traits**(对象):多语言形式的角色性格特征 - **background**(对象):多语言形式的角色背景故事 - **chatting_style**(对象):多语言形式的沟通风格描述 - **relationship**(对象):多语言形式的角色关系描述 - **situation**(对象):多语言形式的场景描述 - **topic**(对象):多语言形式的对话开场白 - **id**(字符串):experience对象自身的唯一标识符。 - **chat_parts**(对象列表):包含多语言消息的对话轮次列表。 - **sender**(字符串):当前轮次发送消息的角色ID。 - **messages**(对象):按语言代码组织的消息,保留了类XML的特殊标签。 - **model**(字符串):用于生成原始英文对话的模型。 ### 语言代码(Language Codes) 本数据集采用ISO 639-1语言代码标准: - **en**:英语(原始语言) - **fr**:法语 - **it**:意大利语 - **de**:德语 - **es**:西班牙语 ### 数据拆分(Data Splits) 本数据集仅提供`train`拆分。鼓励用户根据自身应用场景的需求,自行构建验证集与测试集拆分。 ## 数据集构建(Dataset Creation) ### 筛选依据(Curation Rationale) 本次多语言版本的构建旨在拓展原始SOC-2508数据集在不同语言社群中的覆盖范围与应用价值。本数据集保留了基于角色的精细化对话结构,并提供高质量翻译,从而支持多语言对话式人工智能系统的研发,使其能够在跨语言场景中维持角色一致性与自然对话流畅度。 ### 源数据(Source Data) 本数据集基于合成生成的[SOC-2508](https://huggingface.co/datasets/marcodsn/SOC-2508)数据集构建。原始英文对话通过自动化翻译方法完成翻译,同时保留了原有结构与特殊格式元素。 ### 翻译流程(Translation Process) 翻译流程如下: 1. **模型选择**:选用**google/gemma-3n-E4B-it**作为翻译模型,因其具备优异的多语言能力与效率比。 2. **基础设施**:以**vLLM**作为推理后端,通过Hugging Face Jobs部署以实现高效(且极快的!)批量处理。 3. **逐字段翻译**:为避免小型翻译模型引发格式错误,每个多语言字段(角色性格、背景故事、聊天消息等)均单独进行翻译。 4. **特殊标签保留**:类XML标签(如`<audio>`、`<image>`、`<delay>`等)均以原始形式保留,以确保跨语言场景下的多媒体对话体验一致性。 5. **质量保障**:通过后处理步骤确保所有语言版本的结构完整性与格式统一性。 ## 使用方法(Usage) from datasets import load_dataset dataset = load_dataset("marcodsn/SOC-2508-MULTI") # 访问英文版本 english_background = dataset["experience"]["persona1"]["background"]["en"] # 访问意大利语版本 italian_background = dataset["experience"]["persona1"]["background"]["it"] # 访问多语言消息 english_messages = dataset["chat_parts"]["messages"]["en"] italian_messages = dataset["chat_parts"]["messages"]["it"] ## 已知局限性(Known Limitations) - **翻译质量**:尽管使用了性能优异的模型进行翻译,但自动化翻译可能无法完美还原所有语言细微差别、文化典故与惯用表达。此外,本次修订中模型未获取完整对话上下文,因此部分翻译后的消息可能显得生硬。 - **继承局限性**:原始SOC-2508数据集的所有局限性均适用于本数据集,包括合成生成特性与潜在偏差。 - **文化适配缺失**:本次翻译仅完成语言层面转换,未进行文化适配,因此部分引用内容可能无法在跨文化语境下实现有效传递。 ## 补充信息(Additional Information) ### 原始数据集(Original Dataset) 本次多语言版本基于[SOC-2508](https://huggingface.co/datasets/marcodsn/SOC-2508)数据集构建。如需了解生成方法与底层角色库的详细信息,请参阅原始数据集卡片。 ### 许可信息(Licensing Information) 本数据集采用CC BY 4.0许可协议,与原始数据集保持一致。 ### 引用信息(Citation Information) 若您在研究中使用本数据集,请按以下方式引用: bibtex @misc{marcodsn_2025_SOC2508_MULTI, title = {Multilingual Synthetic Online Conversations}, author = {Marco De Santis}, year = {2025}, month = {August}, url = {https://huggingface.co/datasets/marcodsn/SOC-2508-MULTI}, }
提供机构:
maas
创建时间:
2025-08-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作