five

m-personas

收藏
魔搭社区2026-01-02 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/BSC-LT/m-personas
下载链接
链接失效反馈
官方服务:
资源简介:
# mPersonas: Multilingual Persona‑Driven Conversational Dataset ## Dataset Summary **mPersonas** is a multilingual open-source dataset with high-quality persona descriptions synthetically generated by DeepSeek-V3–0324. It follows a *persona-driven data synthesis methodology*, similar to [PersonaHub](https://huggingface.co/datasets/proj-persona/PersonaHub). - **Instances:** 510,000 - **Total tokens:** 173M - 28M in personas - 145M in conversations (105M in assistant turns) - **Languages:** 15 - **License:** Apache 2.0 ## Methodology This section describes the step-by-step process for generating the **mPersonas** dataset. <img src="images/diagram.png" alt="Data Generation Pipeline" width="800"/> **Figure:** Data generation pipeline from seed documents to processed conversations. Similarity filtering is applied separately for each language. ### Seed Data Personas are generated from a curated subset of documents from [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2), selected for their quality, extensive deduplication, and multilingual cultural diversity. Each language-specific persona is seeded exclusively from documents in the corresponding language subset, ensuring linguistic and cultural relevance. ### Persona Generation We follow PersonaHub's approach, utilizing both text-to-persona and persona-to-persona generation methods. Personas vary in length and complexity to enhance dataset diversity. Persona-to-persona generation specifically helps surface uncommon but realistic personas. Generation prompts are language-specific, significantly increasing generation success rates compared to multilingual prompts. ### Persona Filtering The post-processing pipeline includes three filtering stages: - **Format validation**: Ensuring generated personas follow the required JSON structure. - **Language detection**: Using [GlotLID](https://huggingface.co/cis-lmu/glotlid) to verify linguistic consistency between generated personas and seed texts. Personas failing this check or scoring below 0.5 in language detection are discarded. - **Deduplication**: - **N-gram Deduplication**: Minhash (shingle size = 1, similarity threshold = 0.7, 128 permutations) and LSH to remove near-duplicates efficiently. - **Embedding-based Deduplication**: LaBSE embeddings with cosine similarity threshold set at 0.85 to detect and remove finer-grained duplicates within each language subset. Persona filtering removed between 4% and 35% of generated content, depending on the language. <img src="images/UMAP_all-data.png" alt="m-personas distribution" width="500"/> **Figure:** UMAP projection of main topics generated on our dataset. ### Conversation Filtering Final conversations undergo structural validation, including checks for: - Correct format. - Strict alternation between 'user' and 'assistant' roles. - Beginning with 'user' and ending with 'assistant'. ## Supported Tasks and Use Cases ### Direct Uses This dataset is a foundational resource designed to support the generation of diverse synthetic data. Researchers and developers can incorporate persona descriptions into prompts to enhance diversity in downstream synthetic datasets. ### Out-of-scope Use While it can be used for LLM training or fine-tuning, the dataset was primarily designed to support diverse synthetic data generation, rather than to directly optimize model performance. ## Dataset Structure Each example is structured as follows: ```json { "id": "<unique_alphanumerical_id>", "persona": "<persona_description>", "parent-persona-id": "...", // null if not using persona-to-persona "prompt": "", "messages": [ { "role": "user", "content": "" }, { "role": "assistant", "content": "" }, ... ], "lang": "<ISO639-code>", "type": "<general/reasoning>", "theme": "", "source_text": "<original_fineweb_text>" // null if persona-to-persona } ``` ## How to Use ```python from datasets import load_dataset personas_es = load_dataset("BSC-LT/m-personas", "es") ``` ## Bias, Risks, and Limitations All datasets inevitably carry inherent biases, and this one is no exception. Given that it has been synthetically generated using a language model, its content is conditioned by the biases present in that generating model. However, by grounding these synthetic generations in seed data, we aim to mitigate additional biases by anchoring the outputs to the distribution and characteristics of the original base dataset, thus limiting the risk of compounding model-specific biases. We report that upon superficial analysis, the personas generated present multiple societal and cultural biases, with strong gender and gender-occupation biases among the most prominent. Although it is deeply concerning, examining the resulting conversations, we see that these are not clearly transferred to the instruction dataset, alleviating the issue for downstream uses. We highlight that our analyses of these biases are by no means exhaustive and fully acknowledge the inherent risks associated with releasing datasets that include biases. We urge researchers and developers to use it responsibly and take it into account, performing safety testing specific to their application. ## License This dataset is released under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). ## Acknowledgements This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje. ## Citation ```bibtex @software{BSC-LT/m-personas, author = {Da Dalt, Severino and Pikabea, Iñigo and Prats, Jaume and Pamies, Marc}, title = {mPersonas Dataset}, month = {July}, year = {2025}, url = {https://huggingface.co/datasets/BSC-LT/m-personas} } ```

# mPersonas: 多语言角色驱动对话数据集 ## 数据集概述 **mPersonas** 是一款多语言开源数据集,其高质量角色描述由DeepSeek-V3–0324合成生成。该数据集遵循*角色驱动的数据合成方法*,与[PersonaHub](https://huggingface.co/datasets/proj-persona/PersonaHub)的设计思路一致。 - **实例总数**:510,000 - **总Token数**:1.73亿 - 角色描述部分占2800万 - 对话部分占1.45亿(其中助手回复占1.05亿) - **支持语言**:15种 - **授权协议**:Apache 2.0 ## 生成方法 本章节详细阐述生成**mPersonas**数据集的完整分步流程。 ![数据生成流水线](images/diagram.png) **图**:从种子文档到处理后对话的数据生成流水线。针对每种语言单独执行相似度过滤。 ### 种子数据 角色描述源自[FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)中经过精选的文档子集,该子集以高质量、充分去重以及多语言文化多样性为筛选标准。每种语言专属的角色描述仅从对应语言子集的文档中生成,以确保语言与文化适配性。 ### 角色生成 我们沿用PersonaHub的方案,同时采用文本转角色、角色转角色两种生成方式。角色的长度与复杂度各不相同,以提升数据集的多样性。其中角色转角色生成方式尤其有助于生成罕见但真实的角色。生成提示词均为语言专属,相较于多语言提示词,能显著提升生成成功率。 ### 角色过滤 后处理流水线包含三个过滤阶段: - **格式校验**:确保生成的角色描述符合要求的JSON结构。 - **语言检测**:使用[GlotLID](https://huggingface.co/cis-lmu/glotlid)验证生成的角色描述与种子文本的语言一致性。未通过该检测或语言检测得分低于0.5的角色将被舍弃。 - **去重处理**: - **N-gram去重**:采用Minhash(shingle大小=1,相似度阈值=0.7,128个置换)与局部敏感哈希(LSH)高效移除近似重复项。 - **基于嵌入的去重**:使用LaBSE嵌入,以余弦相似度阈值0.85检测并移除各语言子集内的细粒度重复项。 根据语言不同,角色过滤阶段会移除4%至35%的生成内容。 ![mPersonas数据分布](images/UMAP_all-data.png) **图**:本数据集生成主题的UMAP投影可视化。 ### 对话过滤 最终的对话数据需通过结构校验,包括: - 格式正确性校验 - 严格交替的“用户”与“助手”角色轮次 - 以“用户”发言开头,以“助手”发言结尾 ## 支持任务与应用场景 ### 直接应用 本数据集是一款基础资源,旨在支持多样化合成数据的生成。研究人员与开发者可将角色描述嵌入提示词中,以提升下游合成数据集的多样性。 ### 不适用场景 尽管该数据集可用于大语言模型(Large Language Model,LLM)的训练或微调,但其核心设计目标是支撑多样化合成数据生成,而非直接优化模型性能。 ## 数据集结构 每条样本的结构如下: json { "id": "<唯一字母数字标识符>", "persona": "<角色描述>", "parent-persona-id": "...", // 未使用角色转角色生成时为null "prompt": "", "messages": [ { "role": "user", "content": "" }, { "role": "assistant", "content": "" }, ... ], "lang": "<ISO639语言代码>", "type": "<general/reasoning>", "theme": "", "source_text": "<原始FineWeb文档文本>" // 未使用角色转角色生成时为null } ## 使用方法 python from datasets import load_dataset personas_es = load_dataset("BSC-LT/m-personas", "es") ## 偏差、风险与局限性 所有数据集都不可避免地存在固有偏差,本数据集亦不例外。由于其采用语言模型合成生成,其内容会受到生成模型自带偏差的影响。不过,我们通过将合成生成锚定在种子数据之上,旨在通过将输出与原始基础数据集的分布和特征绑定,缓解额外偏差,从而降低模型特定偏差被放大的风险。 我们的初步分析显示,生成的角色描述存在多类社会与文化偏差,其中突出的表现为性别与性别-职业相关偏差。尽管这一现象令人担忧,但在对生成对话的检视中发现,这些偏差并未明显迁移至指令数据集当中,因此在下游应用中该问题有所缓解。 我们需强调,本次偏差分析绝非全面彻底,我们充分承认发布包含偏差的数据集所固有的风险。我们敦促研究人员与开发者负责任地使用本数据集,并将其纳入考量,针对自身应用场景开展专属的安全性测试。 ## 授权协议 本数据集采用[Apache 2.0协议](https://www.apache.org/licenses/LICENSE-2.0)发布。 ## 致谢 本工作由西班牙数字化转型与公共职能部以及“恢复、转型与韧性计划”推动并资助,该计划由欧盟下一代欧盟(EU – NextGenerationEU)出资,隶属于“语言模型”项目框架。 ## 引用 bibtex @software{BSC-LT/m-personas, author = {Da Dalt, Severino and Pikabea, Iñigo and Prats, Jaume and Pamies, Marc}, title = {mPersonas Dataset}, month = {July}, year = {2025}, url = {https://huggingface.co/datasets/BSC-LT/m-personas} }
提供机构:
maas
创建时间:
2025-11-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作