Polygl0t/multilingual-personas

Name: Polygl0t/multilingual-personas
Creator: Polygl0t
Published: 2026-04-07 11:41:43
License: 暂无描述

Hugging Face2026-04-07 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/Polygl0t/multilingual-personas

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: ablation features: - name: id dtype: string - name: first_name dtype: string - name: middle_name dtype: string - name: last_name dtype: string - name: full_name dtype: string - name: age dtype: int64 - name: gender dtype: string - name: location dtype: string - name: location_country dtype: string - name: location_iso_a2 dtype: string - name: location_iso_a3 dtype: string - name: profession dtype: string - name: profession_en dtype: string - name: backstory dtype: string - name: backstory_en dtype: string - name: language dtype: string - name: prompt_version dtype: string - name: generator dtype: string - name: inferred_ethnicity dtype: string - name: propp_type dtype: string - name: propp_type_justification dtype: string - name: backstory_sentiment_all_probs list: - name: label dtype: string - name: score dtype: float64 - name: backstory_sentiment_top list: - name: label dtype: string - name: score dtype: float64 splits: - name: train num_bytes: 6628415 num_examples: 12000 download_size: 6628415 dataset_size: 6628415 - config_name: default features: - name: id dtype: string - name: first_name dtype: string - name: middle_name dtype: string - name: last_name dtype: string - name: full_name dtype: string - name: inferred_ethnicity dtype: string - name: age dtype: int64 - name: gender dtype: string - name: location dtype: string - name: location_country dtype: string - name: location_iso_a2 dtype: string - name: location_iso_a3 dtype: string - name: profession dtype: string - name: profession_en dtype: string - name: backstory dtype: string - name: backstory_en dtype: string - name: backstory_sentiment_all_probs list: - name: label dtype: string - name: score dtype: float64 - name: backstory_sentiment_top list: - name: label dtype: string - name: score dtype: float64 - name: propp_type dtype: string - name: propp_type_justification dtype: string - name: language dtype: string - name: generator dtype: string - name: prompt_version dtype: string splits: - name: train num_bytes: 23575590 num_examples: 40000 download_size: 23575590 dataset_size: 23575590 configs: - config_name: ablation data_files: - split: train path: ablation/train-* - config_name: default default: true data_files: - split: train path: default/train-* license: other language: - en - pt - es - de tags: - synthetic - personas pretty_name: Multilingual Personas size_categories: - 10K<n<100K --- # 👌 _All Too Perfect_ : Bias and Aspiration in Persona Generation with LLMs > Synthetic data has become a cornerstone in the development of modern large language models. As the demand for these models grows, encompassing ever more niche domains, developers increasingly rely on synthetic corpora to augment or replace organically sourced data, especially in contexts where information is private, scarce, or ethically sensitive. However, just as organic datasets reflect the structural inequalities of the societies that produce them, synthetic datasets mirror not only these well-known sources of bias but also the design choices behind their creation. Decisions intended to optimize coherence, helpfulness, or general capabilities inadvertently constrain generative models in unexpected ways. ## Summary This is a collection of synthetic personas generated by prompting [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) and [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) in a multilingual setting. Models were used in BF16 precision without quantization. We used [vLLM](https://github.com/vllm-project/vllm) (v0) as our inference engine on 4 X A100 GPUS using 4-fold tensor parallelism. ## Subsets This dataset is organized into two subsets (configurations): ### `default` (40,000 personas) The primary dataset used to produce the **main results** of the study. It contains **40,000 personas** generated across 2 models (Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct) and 4 languages (English, Portuguese, Spanish, and German), with 5,000 personas per language per model. All personas in this subset were generated using a single prompt template (`v0`), described in the [Generation Setting](#generation-setting) section below. ```python from datasets import load_dataset ds = load_dataset("Polygl0t/multilingual-personas", "default") ``` ### `ablation` (12,000 personas) The subset used for the **prompt ablation experiments**. It contains **12,000 English-only personas** generated across the same 2 models using alternative prompt variants that systematically alter the prompt structure — including field ordering, framing, and level of constraint — to study how prompt design choices affect the properties of the generated personas. | Version | Description | |---------|-------------------------------------------------------------------------------------------| | **v0** | Location-first order. Original prompt design. | | **v1** | Reordered fields (Name-first, Backstory before Profession). Character-generation framing. | | **v2** | Location-first order. Realistic-persona framing ("computational psychology experiment"). | | **v3** | Minimal constraint — bullet-point style, no numbered fields, shorter instructions. | | **v4** | Backstory-first with narrative framing (conceive backstory, then fill profile). | | **v5** | Name-first order + realistic-persona framing (combines v1 reorder with v2 framing). | ```python from datasets import load_dataset ds = load_dataset("Polygl0t/multilingual-personas", "ablation") ``` ## Generation Setting Personas were generated using a system prompt translated into 4 languages (English, Portuguese, Spanish, and German). 5,000 personas were generated per language, per model, totaling 40K. To promote variety in the generation process, sampling parameters were set at the following configurations: ```yml temperature: 1.5 top_k: 100 top_p: 0.9 repetition_penalty: 1.2 best_of: 4 ``` ### Prompting Below is the English version of our generation prompt: ```text You are a character generator. When requested, produce a detailed profile for an original person with the ordered structure. Imagine a fictional person with the following attributes: 1. Location: Specify a country, city, or state. The location should feel authentic and influence the character's background. 2. Name: Provide a first and last name that is statistically common for the given location, considering its history, culture, and linguistic traits. 3. Gender: Choose Male, Female, or Non-binary. Stick to these three for simplicity. 4. Age: Consider the character's background and assign a realistic integer age within the natural human lifespan. 5. Profession: A distinct occupation or role, limited to 1-4 words. 6. Backstory: A 1-2 sentence description incorporating cultural, historical, or personal details relevant to the character's life, upbringing, key events, motivations, and profession. ### Output Formatting Guidelines: - Start directly with "Location:" - Use the exact labels: Location, Name, Gender, Age, Profession, and Backstory (each followed by a colon). - No markdown, no bullet points, no extra formatting. Keep each field on its own line. - Ensure the backstory is concise, weaving in cultural, historical, or personal elements tied to the described person. ``` More details and implementations (e.g., all prompt variations) can be found in the [`src`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/tree/main/src) folder. ## Generation Pipeline Both the `default` and `ablation` subsets were constructed through the same multi-step pipeline. For full reproduction details, see the scripts in the [`src`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/tree/main/src) folder. 1. **Generation**: Synthetic personas are produced by prompting each model with the appropriate prompt template and language using [vLLM](https://github.com/vllm-project/vllm) on a multi-GPU setup (see [`generate_synth_persona.sh`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/blob/main/src/generate_synth_persona.sh) for details). 2. **Extraction**: Raw model outputs are parsed into structured 6-field JSON records (`name`, `gender`, `age`, `location`, `profession`, `backstory`) using an LLM-based extraction step (see [`extract.sh`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/blob/main/src/extract.sh) for details). 3. **Normalisation** — Field values are standardised: names are split into first/middle/last components, genders are mapped to a canonical set, and locations are enriched with ISO country codes (see [`normalize.py`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/blob/main/src/normalize.py) for details). 4. **Ethnicity inference** — The [name-to-ethnicity API](https://api.name-to-ethnicity.com) is used to infer an `inferred_ethnicity` label from each persona's full name (see [`ethnicity.py`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/blob/main/src/ethnicity.py) for details). 5. **Propp archetype classification** — An LLM classifier assigns a Propp narrative archetype (`propp_type`) and justification to each persona based on their name, profession, and backstory (see [`propp_classifier.sh`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/blob/main/src/propp_classifier.sh) for details). 6. **Sentiment analysis**: A multilingual sentiment model ([`tabularisai/multilingual-sentiment-analysis`](https://huggingface.co/tabularisai/multilingual-sentiment-analysis)) scores each backstory, producing both a full probability distribution and a top-label summary (see [`sentiment_classifier.sh`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/blob/main/src/sentiment_classifier.sh) for details). 7. **Translation**: For non-English personas, backstories and professions are translated into English using an LLM-based translation step (see the translation prompts in [`translation_prompts.md`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/blob/main/src/translation_prompts.md) for details). 8. **Analysis**: The enriched records are loaded into pandas for comparative analysis across models, languages, and prompt variants (see [`analysis.ipynb`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/blob/main/src/analysis.ipynb) and [`ablation.ipynb`](https://huggingface.co/datasets/Polygl0t/multilingual-personas/blob/main/src/ablation.ipynb) for details). ## Data Fields Example sample: ```json { "id": "88a9880b2113044101b48139929e6aca", "first_name": "María", "middle_name": null, "last_name": "Fernández", "full_name": "María Fernández", "inferred_ethnicity": "Hispanic/Latin American", "age": 32, "gender": "female", "location": "Lima, Perú", "location_country": "Peru", "location_iso_a2": "PE", "location_iso_a3": "PER", "profession": "Chef Tradicional", "profession_en": "Traditional Chef", "backstory": "Criada en una familia dedicada a la cocina andina, María aprendió recetas ancestrales y ahora fusiona técnicas modernas con ingredientes locales para preservar y promover su herencia cultural gastronómica.", "backstory_en": "Raised in a family dedicated to Andean cuisine, María learned ancestral recipes and now fuses modern techniques with local ingredients to preserve and promote her gastronomic cultural heritage.", "backstory_sentiment_all_probs": [ { "label": "Very Negative", "score": 0.027090465649962425 }, { "label": "Negative", "score": 0.026620419695973396 }, { "label": "Neutral", "score": 0.20174773037433624 }, { "label": "Positive", "score": 0.5016037821769714 }, { "label": "Very Positive", "score": 0.2429376095533371 } ], "backstory_sentiment_top": [ { "label": "Positive", "score": 0.5016037821769714 } ], "propp_type": "donor", "propp_type_justification": "María preserves and transmits ancestral culinary knowledge, blending tradition with modern techniques. As a keeper and provider of valuable cultural tools (recipes, techniques, heritage), she fits the archetype of the donor—someone who offers empowering knowledge or items to others, often enabling a hero's journey.", "language": "spanish", "generator": "Qwen-2.5-72B-Instruct", "prompt_version": "v0" } ``` ## License All samples generated by Llama-3.3-70B-Instruct are subject to the [Llama 3.3 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/LICENSE). Meanwhile, all samples generated by Qwen-2.5-72B-Instruct are subject to [Qwen License Agreement](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE). ## Aknowlegments This project was primarily funded by the Ministerium für Wirtschaft, Industrie, Klimaschutz und Energie des Landes Nordrhein-Westfalen (Ministry for Economic Affairs, Industry, Climate Action and Energy of the State of North Rhine-Westphalia), as part of the KI.NRW-flagship project "[Zertifizierte KI](https://www.zertifizierte-ki.de/)" (Certified AI). The authors also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab.

提供机构：

Polygl0t

搜集汇总

数据集介绍

构建方式

在合成数据日益成为大语言模型发展基石的背景下，multilingual-personas数据集通过精心设计的生成流程构建而成。研究团队采用Llama-3.3-70B-Instruct与Qwen2.5-72B-Instruct两种先进模型，在英、葡、西、德四种语言环境下，以BF16精度配合vLLM推理引擎进行并行生成。生成过程中运用了温度值为1.5、top-k为100等多样化采样参数，并依据统一的多语言系统提示词，为每种语言与模型组合生成五千条人物档案，最终汇集四万条结构化数据。生成文本经过系统化的后处理流程，包括情感分析、角色类型标注与地理信息标准化，确保了数据的规范性与可用性。

使用方法

该数据集适用于自然语言处理领域中关于合成数据质量、模型偏见分析与多语言表征研究的多个方向。研究人员可直接通过HuggingFace平台加载数据集，利用其标准化的字段结构进行数据分析或模型训练。例如，可基于地理位置与职业字段研究生成模型的社会文化刻板印象，或利用多语言背景故事及其情感标签进行跨语言情感一致性分析。此外，数据集标注的角色类型信息为叙事生成、对话系统的人物一致性构建等任务提供了宝贵的资源。使用前需注意区分不同生成模型对应的许可协议，确保符合Llama与Qwen各自的许可证要求。

背景与挑战

背景概述

在人工智能领域，合成数据已成为训练大型语言模型的关键资源，尤其在处理涉及隐私、稀缺或伦理敏感信息的场景时。multilingual-personas数据集于近期由德国北莱茵-威斯特法伦州经济事务、工业、气候行动和能源部资助的“Zertifizierte KI”旗舰项目支持创建，并借助波恩大学的高性能计算设施完成。该数据集旨在通过Llama-3.3-70B-Instruct和Qwen2.5-72B-Instruct模型，生成涵盖英语、葡萄牙语、西班牙语和德语的多语言虚拟人物档案，以探索合成人物角色在文化代表性、叙事连贯性及社会偏见映射方面的核心问题。其构建不仅为跨语言自然语言处理任务提供了丰富的结构化语料，更推动了关于生成模型在模拟人类多样性时潜在局限性的学术讨论。

当前挑战

该数据集致力于解决多语言人物角色生成中的领域挑战，即如何确保合成角色在文化背景、职业分布及人口统计学特征上具备真实且无偏见的代表性。生成过程面临模型固有偏见放大的风险，例如在姓名、地域和职业的分配上可能无意识地强化刻板印象。构建过程中的挑战体现在多语言提示的设计与后处理环节，需要平衡语言间的对等性与文化特异性，同时通过情感分析和叙事类型标注来验证内容的合理性与深度，这些步骤对计算资源与算法鲁棒性提出了较高要求。

常用场景

经典使用场景

在自然语言处理领域，合成数据集正逐渐成为模型训练与评估的关键资源。multilingual-personas数据集以其多语言、结构化的人物档案，为研究者提供了一个模拟真实世界多样性的基准测试平台。该数据集常被用于评估大型语言模型在跨文化语境下的生成能力与偏见表现，尤其是在人物角色构建、叙事连贯性以及文化敏感性分析等方面，为模型优化提供了丰富的实验素材。

解决学术问题

该数据集致力于解决合成数据生成中的偏见与代表性难题。通过系统化生成涵盖不同语言、地域、职业与背景的人物档案，它帮助学术界深入探究大型语言模型在文化刻板印象、性别平等及地域多样性方面的潜在偏差。其意义在于为公平性评估与去偏技术开发提供了标准化数据基础，推动了人工智能伦理研究的实证化进程。

实际应用

在实际应用层面，multilingual-personas数据集为个性化内容生成、跨文化对话系统以及虚拟助手开发提供了重要支持。例如，在教育与娱乐领域，该数据集可用于构建具有文化适配性的虚拟角色，增强用户体验的真实感与包容性。同时，它也为企业进行全球化市场分析、用户画像建模提供了多样化的合成数据来源。

数据集最近研究