Mostafa190/TwinnyAI-Personas-Dataset
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Mostafa190/TwinnyAI-Personas-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- text-classification
language:
- en
tags:
- personas
- synthetic
- behavioral-conditioning
- psychology
- professional-profiles
- role-playing
- fine-tuning
- roleplay
- instruction-tuning
- digital-twins
- ai-agents
pretty_name: TWINNY.AI Personas Dataset
size_categories:
- 10K<n<100K
multilinguality:
- monolingual
source_datasets:
- original
---
<div align="center">
<img src="banner.png" alt="TWINNY.AI Banner" width="100%">
</div>
## Overview
The **TWINNY.AI Personas Dataset** is a synthetic collection of **400 richly structured professional personas**, engineered to power behavioral AI twins, persona-driven language model fine-tuning, and professional simulation systems.
Each persona is built from **14 attributes** spanning demographics, professional context, behavioral psychology, and communication style sampled with **realistic non-uniform distributions** that mirror actual workforce demographics rather than uniform random sampling.
This dataset is the core training resource for the **TWINNY.AI** project: a system designed to create AI behavioral twins that authentically reflect human professional archetypes across industries, seniority levels, and cultural contexts.
---
## Dataset at a Glance
| Property | Value |
|----------|-------|
| Total personas | **400** |
| Attributes per persona | **14** |
| Industries covered | **13** |
| Cultural contexts | **7** |
| Role levels | **6** |
| Age ranges | **6** |
| Behavioral scores | **7** (scale 1–5) |
| Generation seed | `42` (fully reproducible) |
| License | MIT |
| Formats | CSV, JSONL, JSON |
---
## Dataset Structure
```
TwinnyAI-Personas-Dataset/
├── TwinnyAI_final_dataset.csv ← Complete validated dataset (primary file)
├── TwinnyAI_train.jsonl ← Training split, JSONL format (LLM-ready)
├── personas.csv ← Raw persona profiles
└── personas.json ← Full JSON objects with nested structure
```
---
## Schema — Field Reference
| Field | Type | Description | Example |
|-------|------|-------------|---------|
| `persona_id` | int | Unique persona identifier | `1` |
| `age_range` | string | Age bracket | `"29–35"` |
| `industry` | string | Professional industry | `"Technology"` |
| `role_level` | string | Seniority level | `"Senior"` |
| `cultural_context` | string | Cultural background | `"North American"` |
| `communication_style` | string | Primary communication mode | `"detailed and analytical"` |
| `risk_tolerance_score` | int (1–5) | Appetite for uncertainty and bold decisions | `4` |
| `approval_threshold_score` | int (1–5) | Need for external validation before acting | `2` |
| `formality_score` | int (1–5) | Preference for formal vs. informal interaction | `4` |
| `conciseness_score` | int (1–5) | Tendency toward brief vs. elaborate responses | `3` |
| `confidence_score` | int (1–5) | Self-assurance in decisions and communication | `4` |
| `decision_speed_score` | int (1–5) | Speed of reaching conclusions | `3` |
| `ai_trust_score` | int (1–5) | Openness to AI-assisted decision-making | `2` |
| `delegation_preference` | bool | Comfort with delegating tasks to others | `FALSE` |
| `created_at` | ISO 8601 | Generation timestamp | `"2026-02-27T13:06:40"` |
---
## Sample Record
```json
{
"persona_id": 1,
"age_range": "29–35",
"industry": "Technology",
"role_level": "Senior",
"cultural_context": "North American",
"communication_style": "detailed and analytical",
"risk_tolerance_score": 4,
"approval_threshold_score": 2,
"formality_score": 4,
"conciseness_score": 3,
"confidence_score": 4,
"decision_speed_score": 3,
"ai_trust_score": 2,
"delegation_preference": false,
"created_at": "2026-02-27T13:06:40.273465"
}
```
---
## Distribution Statistics
> All statistics below are computed directly from the 400-persona dataset.
### 🏭 Industry Distribution
| Industry | Count | % | |
|----------|------:|--:|-|
| Technology | 72 | 18.0% | `████████████████████` |
| Finance | 57 | 14.2% | `████████████████` |
| Healthcare | 40 | 10.0% | `███████████` |
| Consulting | 37 | 9.2% | `██████████` |
| Legal | 28 | 7.0% | `████████` |
| Operations | 28 | 7.0% | `████████` |
| Marketing | 27 | 6.8% | `███████` |
| Education | 24 | 6.0% | `███████` |
| Media | 22 | 5.5% | `██████` |
| Government | 19 | 4.8% | `█████` |
| Retail | 17 | 4.2% | `█████` |
| Manufacturing | 16 | 4.0% | `████` |
| Real Estate | 13 | 3.2% | `████` |
### 🎯 Role Level Distribution
| Role Level | Count | % | |
|------------|------:|--:|-|
| Mid-level | 119 | 29.8% | `████████████████████████████████` |
| Senior | 101 | 25.2% | `████████████████████████████` |
| Junior | 80 | 20.0% | `██████████████████████` |
| Manager | 55 | 13.8% | `███████████████` |
| Director | 28 | 7.0% | `████████` |
| Executive | 17 | 4.2% | `█████` |
### 🎂 Age Range Distribution
| Age Range | Count | % | |
|-----------|------:|--:|-|
| 29–35 | 101 | 25.2% | `████████████████████████████` |
| 36–42 | 89 | 22.2% | `█████████████████████████` |
| 43–50 | 81 | 20.2% | `██████████████████████` |
| 22–28 | 59 | 14.8% | `████████████████` |
| 51–58 | 46 | 11.5% | `█████████████` |
| 59–65 | 24 | 6.0% | `███████` |
### 🌍 Cultural Context Distribution
| Cultural Context | Count | % | |
|-----------------|------:|--:|-|
| North American | 121 | 30.2% | `████████████████████████████████` |
| Western European | 80 | 20.0% | `██████████████████████` |
| East Asian | 60 | 15.0% | `████████████████` |
| South Asian | 47 | 11.8% | `█████████████` |
| Middle Eastern | 32 | 8.0% | `█████████` |
| Latin American | 32 | 8.0% | `█████████` |
| African | 28 | 7.0% | `████████` |
### 💬 Communication Style Distribution
| Style | Count | % |
|-------|------:|--:|
| empathetic and consultative | 50 | 12.5% |
| detailed and analytical | 46 | 11.5% |
| data-driven | 46 | 11.5% |
| diplomatic and cautious | 45 | 11.2% |
| formal and structured | 44 | 11.0% |
| direct and concise | 42 | 10.5% |
| warm and collaborative | 42 | 10.5% |
| technical and precise | 38 | 9.5% |
| narrative-focused | 30 | 7.5% |
| assertive and decisive | 30 | 7.5% |
### 📊 Behavioral Score Summary (scale 1–5)
| Score Field | Mean | Std | Min | Max | Interpretation |
|-------------|-----:|----:|----:|----:|----------------|
| `risk_tolerance_score` | **3.24** | ~1.1 | 1 | 5 | Slight risk appetite |
| `approval_threshold_score` | **3.18** | ~1.1 | 1 | 5 | Moderate validation need |
| `ai_trust_score` | **3.14** | ~1.1 | 1 | 5 | Mild AI openness |
| `formality_score` | **3.05** | ~1.2 | 1 | 5 | Balanced formality |
| `conciseness_score` | **3.02** | ~1.0 | 1 | 5 | Balanced length |
| `decision_speed_score` | **2.99** | ~1.0 | 1 | 5 | Moderate deliberation |
| `confidence_score` | **2.91** | ~1.1 | 1 | 5 | Slightly below-neutral (realistic) |
> Scores follow Gaussian distributions centered at 3.0 with σ ≈ 1.0–1.2, with role-level biases applied via the pipeline's `ROLE_RISK_BIAS` and `ROLE_APPROVAL_BIAS` lookup tables.
### 🤝 Delegation Preference
| Preference | Count | % |
|------------|------:|--:|
| Comfortable delegating (`TRUE`) | ~218 | **54.5%** |
| Prefers direct control (`FALSE`) | ~182 | **45.5%** |
> Generated via `random.random() > 0.45` - approximately 55% of personas prefer delegation.
---
## Behavioral Trait Correlations
The dataset encodes **realistic role-level behavioral biases**, not random sampling:
| Role | Risk Tolerance Bias | Approval Need Bias | Confidence Bias |
|------|:-------------------:|:------------------:|:---------------:|
| Junior | −1 to 0 (cautious) | +1 to +2 (high) | −1 to 0 |
| Mid-level | 0 (neutral) | 0 to +1 | −1 to 0 |
| Senior | 0 to +1 (bold) | −1 to 0 (low) | 0 to +1 |
| Manager | −1 to +1 (spread) | −1 to +1 | 0 to +1 |
| Director | 0 to +1 | −2 to 0 | 0 to +1 |
| Executive | +1 to +2 (high) | −2 to −1 (very low) | 0 to +1 |
This means an **Executive** persona statistically shows high risk tolerance + low approval seeking, matching real-world behavioral research, while a **Junior** persona trends toward caution and external validation.
---
## Usage Examples
### Load with 🤗 Datasets
```python
from datasets import load_dataset
ds = load_dataset("Mostafa190/TwinnyAI-Personas-Dataset")
print(ds)
# DatasetDict({
# train: Dataset({
# features: ['persona_id', 'age_range', 'industry', ...],
# num_rows: 400
# })
# })
print(ds["train"][0])
```
### Load and explore with pandas
```python
import pandas as pd
df = pd.read_csv(
"hf://datasets/Mostafa190/TwinnyAI-Personas-Dataset/TwinnyAI_final_dataset.csv"
)
print(df.shape) # (400, 15)
print(df.describe()) # Score statistics
# Filter: Senior Technology personas who trust AI
tech_seniors = df[
(df["industry"] == "Technology") &
(df["role_level"] == "Senior") &
(df["ai_trust_score"] >= 4)
]
print(f"Matched: {len(tech_seniors)} personas")
```
### Load JSONL for fine-tuning
```python
import json
with open("TwinnyAI_train.jsonl", "r") as f:
personas = [json.loads(line) for line in f]
p = personas[0]
print(f"#{p['persona_id']}: {p['role_level']} in {p['industry']}")
print(f"Style: {p['communication_style']}")
print(f"Risk: {p['risk_tolerance_score']}/5 | AI Trust: {p['ai_trust_score']}/5")
```
### Generate LLM system prompts
```python
import json, random
with open("personas.json") as f:
personas = json.load(f)
def to_system_prompt(p):
return (
f"You are a {p['age_range']}-year-old {p['role_level']} "
f"in the {p['industry']} industry ({p['cultural_context']} background). "
f"Your communication style is {p['communication_style']}. "
f"Risk tolerance: {p['risk_tolerance_score']}/5. "
f"Confidence: {p['confidence_score']}/5. "
f"You {'delegate when possible' if p['delegation_preference'] else 'prefer direct control'}. "
f"Respond consistently in character."
)
persona = random.choice(personas)
print(to_system_prompt(persona))
```
---
## Generation Pipeline
| Script | Role |
|--------|------|
| `01_generate_personas.py` | Samples traits with weighted distributions, applies role-level biases, generates 400 personas with `seed=42` |
| `03_validate_dataset.py` | Schema validation, duplicate removal, score range checks |
| `04_export_jsonl.py` | Exports to JSONL training format |
| `groq_generator.py` | Optional: enriches personas with LLM-generated narrative descriptions via Groq API |
**Fully reproducible:** Running `generate_personas(n=400, seed=42)` produces the exact same 400 personas every time.
---
## Intended Use
- **LLM fine-tuning** : Train models to generate or simulate professional personas
- **Behavioral AI twins** : Power digital twin systems with realistic archetypes
- **AI agent seeding** : Consistent behavioral profiles for role-play agents
- **Prompt engineering** : System prompt templates for persona-conditioned generation
- **UX research & simulation** : Diverse synthetic user profiles for product testing
- **Academic research** : Study trait distributions in synthetic professional populations
## Out-of-Scope Use
- Impersonating or deceiving real individuals
- Building surveillance or profiling systems targeting real people
- Generating discriminatory content based on demographic attributes
---
## Ethics & Limitations
- **Bias transparency** : Cultural distributions reflect configurable pipeline priors (e.g., 30% North American). These are adjustable weights, not value judgments
- **Score design** : Gaussian-sampled scores with σ ≈ 1.0–1.2; extremes (1 or 5) exist but are intentionally less frequent
- **No PII** : Zero personally identifiable information in any field
---
## License
Released under the **[MIT License](https://opensource.org/licenses/MIT)**. Free to use, modify, and distribute with attribution.
---
## Citation
```bibtex
@dataset{twinnyai_personas_2026,
author = {Mostafa190},
title = {TWINNY.AI Personas Dataset},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/Mostafa190/TwinnyAI-Personas-Dataset}},
license = {MIT},
note = {400 synthetic professional personas, Twinnify pipeline, seed=42}
}
```
---
<div align="center">
<sub>Built with ❤️ as part of the TWINNY.AI project · Powered by the 20AI Pipeline</sub>
</div>
提供机构:
Mostafa190



