five

Mostafa190/TwinnyAI-Personas-Dataset

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Mostafa190/TwinnyAI-Personas-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - text-classification language: - en tags: - personas - synthetic - behavioral-conditioning - psychology - professional-profiles - role-playing - fine-tuning - roleplay - instruction-tuning - digital-twins - ai-agents pretty_name: TWINNY.AI Personas Dataset size_categories: - 10K<n<100K multilinguality: - monolingual source_datasets: - original --- <div align="center"> <img src="banner.png" alt="TWINNY.AI Banner" width="100%"> </div> ## Overview The **TWINNY.AI Personas Dataset** is a synthetic collection of **400 richly structured professional personas**, engineered to power behavioral AI twins, persona-driven language model fine-tuning, and professional simulation systems. Each persona is built from **14 attributes** spanning demographics, professional context, behavioral psychology, and communication style sampled with **realistic non-uniform distributions** that mirror actual workforce demographics rather than uniform random sampling. This dataset is the core training resource for the **TWINNY.AI** project: a system designed to create AI behavioral twins that authentically reflect human professional archetypes across industries, seniority levels, and cultural contexts. --- ## Dataset at a Glance | Property | Value | |----------|-------| | Total personas | **400** | | Attributes per persona | **14** | | Industries covered | **13** | | Cultural contexts | **7** | | Role levels | **6** | | Age ranges | **6** | | Behavioral scores | **7** (scale 1–5) | | Generation seed | `42` (fully reproducible) | | License | MIT | | Formats | CSV, JSONL, JSON | --- ## Dataset Structure ``` TwinnyAI-Personas-Dataset/ ├── TwinnyAI_final_dataset.csv ← Complete validated dataset (primary file) ├── TwinnyAI_train.jsonl ← Training split, JSONL format (LLM-ready) ├── personas.csv ← Raw persona profiles └── personas.json ← Full JSON objects with nested structure ``` --- ## Schema — Field Reference | Field | Type | Description | Example | |-------|------|-------------|---------| | `persona_id` | int | Unique persona identifier | `1` | | `age_range` | string | Age bracket | `"29–35"` | | `industry` | string | Professional industry | `"Technology"` | | `role_level` | string | Seniority level | `"Senior"` | | `cultural_context` | string | Cultural background | `"North American"` | | `communication_style` | string | Primary communication mode | `"detailed and analytical"` | | `risk_tolerance_score` | int (1–5) | Appetite for uncertainty and bold decisions | `4` | | `approval_threshold_score` | int (1–5) | Need for external validation before acting | `2` | | `formality_score` | int (1–5) | Preference for formal vs. informal interaction | `4` | | `conciseness_score` | int (1–5) | Tendency toward brief vs. elaborate responses | `3` | | `confidence_score` | int (1–5) | Self-assurance in decisions and communication | `4` | | `decision_speed_score` | int (1–5) | Speed of reaching conclusions | `3` | | `ai_trust_score` | int (1–5) | Openness to AI-assisted decision-making | `2` | | `delegation_preference` | bool | Comfort with delegating tasks to others | `FALSE` | | `created_at` | ISO 8601 | Generation timestamp | `"2026-02-27T13:06:40"` | --- ## Sample Record ```json { "persona_id": 1, "age_range": "29–35", "industry": "Technology", "role_level": "Senior", "cultural_context": "North American", "communication_style": "detailed and analytical", "risk_tolerance_score": 4, "approval_threshold_score": 2, "formality_score": 4, "conciseness_score": 3, "confidence_score": 4, "decision_speed_score": 3, "ai_trust_score": 2, "delegation_preference": false, "created_at": "2026-02-27T13:06:40.273465" } ``` --- ## Distribution Statistics > All statistics below are computed directly from the 400-persona dataset. ### 🏭 Industry Distribution | Industry | Count | % | | |----------|------:|--:|-| | Technology | 72 | 18.0% | `████████████████████` | | Finance | 57 | 14.2% | `████████████████` | | Healthcare | 40 | 10.0% | `███████████` | | Consulting | 37 | 9.2% | `██████████` | | Legal | 28 | 7.0% | `████████` | | Operations | 28 | 7.0% | `████████` | | Marketing | 27 | 6.8% | `███████` | | Education | 24 | 6.0% | `███████` | | Media | 22 | 5.5% | `██████` | | Government | 19 | 4.8% | `█████` | | Retail | 17 | 4.2% | `█████` | | Manufacturing | 16 | 4.0% | `████` | | Real Estate | 13 | 3.2% | `████` | ### 🎯 Role Level Distribution | Role Level | Count | % | | |------------|------:|--:|-| | Mid-level | 119 | 29.8% | `████████████████████████████████` | | Senior | 101 | 25.2% | `████████████████████████████` | | Junior | 80 | 20.0% | `██████████████████████` | | Manager | 55 | 13.8% | `███████████████` | | Director | 28 | 7.0% | `████████` | | Executive | 17 | 4.2% | `█████` | ### 🎂 Age Range Distribution | Age Range | Count | % | | |-----------|------:|--:|-| | 29–35 | 101 | 25.2% | `████████████████████████████` | | 36–42 | 89 | 22.2% | `█████████████████████████` | | 43–50 | 81 | 20.2% | `██████████████████████` | | 22–28 | 59 | 14.8% | `████████████████` | | 51–58 | 46 | 11.5% | `█████████████` | | 59–65 | 24 | 6.0% | `███████` | ### 🌍 Cultural Context Distribution | Cultural Context | Count | % | | |-----------------|------:|--:|-| | North American | 121 | 30.2% | `████████████████████████████████` | | Western European | 80 | 20.0% | `██████████████████████` | | East Asian | 60 | 15.0% | `████████████████` | | South Asian | 47 | 11.8% | `█████████████` | | Middle Eastern | 32 | 8.0% | `█████████` | | Latin American | 32 | 8.0% | `█████████` | | African | 28 | 7.0% | `████████` | ### 💬 Communication Style Distribution | Style | Count | % | |-------|------:|--:| | empathetic and consultative | 50 | 12.5% | | detailed and analytical | 46 | 11.5% | | data-driven | 46 | 11.5% | | diplomatic and cautious | 45 | 11.2% | | formal and structured | 44 | 11.0% | | direct and concise | 42 | 10.5% | | warm and collaborative | 42 | 10.5% | | technical and precise | 38 | 9.5% | | narrative-focused | 30 | 7.5% | | assertive and decisive | 30 | 7.5% | ### 📊 Behavioral Score Summary (scale 1–5) | Score Field | Mean | Std | Min | Max | Interpretation | |-------------|-----:|----:|----:|----:|----------------| | `risk_tolerance_score` | **3.24** | ~1.1 | 1 | 5 | Slight risk appetite | | `approval_threshold_score` | **3.18** | ~1.1 | 1 | 5 | Moderate validation need | | `ai_trust_score` | **3.14** | ~1.1 | 1 | 5 | Mild AI openness | | `formality_score` | **3.05** | ~1.2 | 1 | 5 | Balanced formality | | `conciseness_score` | **3.02** | ~1.0 | 1 | 5 | Balanced length | | `decision_speed_score` | **2.99** | ~1.0 | 1 | 5 | Moderate deliberation | | `confidence_score` | **2.91** | ~1.1 | 1 | 5 | Slightly below-neutral (realistic) | > Scores follow Gaussian distributions centered at 3.0 with σ ≈ 1.0–1.2, with role-level biases applied via the pipeline's `ROLE_RISK_BIAS` and `ROLE_APPROVAL_BIAS` lookup tables. ### 🤝 Delegation Preference | Preference | Count | % | |------------|------:|--:| | Comfortable delegating (`TRUE`) | ~218 | **54.5%** | | Prefers direct control (`FALSE`) | ~182 | **45.5%** | > Generated via `random.random() > 0.45` - approximately 55% of personas prefer delegation. --- ## Behavioral Trait Correlations The dataset encodes **realistic role-level behavioral biases**, not random sampling: | Role | Risk Tolerance Bias | Approval Need Bias | Confidence Bias | |------|:-------------------:|:------------------:|:---------------:| | Junior | −1 to 0 (cautious) | +1 to +2 (high) | −1 to 0 | | Mid-level | 0 (neutral) | 0 to +1 | −1 to 0 | | Senior | 0 to +1 (bold) | −1 to 0 (low) | 0 to +1 | | Manager | −1 to +1 (spread) | −1 to +1 | 0 to +1 | | Director | 0 to +1 | −2 to 0 | 0 to +1 | | Executive | +1 to +2 (high) | −2 to −1 (very low) | 0 to +1 | This means an **Executive** persona statistically shows high risk tolerance + low approval seeking, matching real-world behavioral research, while a **Junior** persona trends toward caution and external validation. --- ## Usage Examples ### Load with 🤗 Datasets ```python from datasets import load_dataset ds = load_dataset("Mostafa190/TwinnyAI-Personas-Dataset") print(ds) # DatasetDict({ # train: Dataset({ # features: ['persona_id', 'age_range', 'industry', ...], # num_rows: 400 # }) # }) print(ds["train"][0]) ``` ### Load and explore with pandas ```python import pandas as pd df = pd.read_csv( "hf://datasets/Mostafa190/TwinnyAI-Personas-Dataset/TwinnyAI_final_dataset.csv" ) print(df.shape) # (400, 15) print(df.describe()) # Score statistics # Filter: Senior Technology personas who trust AI tech_seniors = df[ (df["industry"] == "Technology") & (df["role_level"] == "Senior") & (df["ai_trust_score"] >= 4) ] print(f"Matched: {len(tech_seniors)} personas") ``` ### Load JSONL for fine-tuning ```python import json with open("TwinnyAI_train.jsonl", "r") as f: personas = [json.loads(line) for line in f] p = personas[0] print(f"#{p['persona_id']}: {p['role_level']} in {p['industry']}") print(f"Style: {p['communication_style']}") print(f"Risk: {p['risk_tolerance_score']}/5 | AI Trust: {p['ai_trust_score']}/5") ``` ### Generate LLM system prompts ```python import json, random with open("personas.json") as f: personas = json.load(f) def to_system_prompt(p): return ( f"You are a {p['age_range']}-year-old {p['role_level']} " f"in the {p['industry']} industry ({p['cultural_context']} background). " f"Your communication style is {p['communication_style']}. " f"Risk tolerance: {p['risk_tolerance_score']}/5. " f"Confidence: {p['confidence_score']}/5. " f"You {'delegate when possible' if p['delegation_preference'] else 'prefer direct control'}. " f"Respond consistently in character." ) persona = random.choice(personas) print(to_system_prompt(persona)) ``` --- ## Generation Pipeline | Script | Role | |--------|------| | `01_generate_personas.py` | Samples traits with weighted distributions, applies role-level biases, generates 400 personas with `seed=42` | | `03_validate_dataset.py` | Schema validation, duplicate removal, score range checks | | `04_export_jsonl.py` | Exports to JSONL training format | | `groq_generator.py` | Optional: enriches personas with LLM-generated narrative descriptions via Groq API | **Fully reproducible:** Running `generate_personas(n=400, seed=42)` produces the exact same 400 personas every time. --- ## Intended Use - **LLM fine-tuning** : Train models to generate or simulate professional personas - **Behavioral AI twins** : Power digital twin systems with realistic archetypes - **AI agent seeding** : Consistent behavioral profiles for role-play agents - **Prompt engineering** : System prompt templates for persona-conditioned generation - **UX research & simulation** : Diverse synthetic user profiles for product testing - **Academic research** : Study trait distributions in synthetic professional populations ## Out-of-Scope Use - Impersonating or deceiving real individuals - Building surveillance or profiling systems targeting real people - Generating discriminatory content based on demographic attributes --- ## Ethics & Limitations - **Bias transparency** : Cultural distributions reflect configurable pipeline priors (e.g., 30% North American). These are adjustable weights, not value judgments - **Score design** : Gaussian-sampled scores with σ ≈ 1.0–1.2; extremes (1 or 5) exist but are intentionally less frequent - **No PII** : Zero personally identifiable information in any field --- ## License Released under the **[MIT License](https://opensource.org/licenses/MIT)**. Free to use, modify, and distribute with attribution. --- ## Citation ```bibtex @dataset{twinnyai_personas_2026, author = {Mostafa190}, title = {TWINNY.AI Personas Dataset}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/Mostafa190/TwinnyAI-Personas-Dataset}}, license = {MIT}, note = {400 synthetic professional personas, Twinnify pipeline, seed=42} } ``` --- <div align="center"> <sub>Built with ❤️ as part of the TWINNY.AI project · Powered by the 20AI Pipeline</sub> </div>
提供机构:
Mostafa190
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作