teias-ai/synthia
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/teias-ai/synthia
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: index
dtype: string
- name: persona
dtype: string
- name: followings
list: string
- name: followers
list: string
- name: extractive_persona
dtype: string
- name: recency_persona
dtype: string
splits:
- name: synthia_gemma_3_27b_it
num_bytes: 17799982
num_examples: 2848
- name: synthia_phi_4_mini_instruct
num_bytes: 16676667
num_examples: 2848
- name: anthology_llama_3_8b
num_bytes: 6148856
num_examples: 3000
- name: anthology_gemma_3_27b
num_bytes: 8470793
num_examples: 3000
- name: persona_chat
num_bytes: 638593
num_examples: 3000
download_size: 29176298
dataset_size: 49734891
configs:
- config_name: default
data_files:
- split: synthia_gemma_3_27b_it
path: data/synthia_gemma_3_27b_it-*
- split: synthia_phi_4_mini_instruct
path: data/synthia_phi_4_mini_instruct-*
- split: anthology_llama_3_8b
path: data/anthology_llama_3_8b-*
- split: anthology_gemma_3_27b
path: data/anthology_gemma_3_27b-*
- split: persona_chat
path: data/persona_chat-*
task_categories:
- text-generation
language:
- en
tags:
- Persona
- Social-Media
- Role-Playing
- User-Modeling
- Survey-Simulation
- Agent
- Profiling
- Synthetic-Data
- Human-Simulation
- Opinion-Modeling
pretty_name: SYNTHIA
size_categories:
- 10K<n<100K
---
# Dataset Card for Dataset Name
SYNTHIA (Synthetic Yet Naturally Tailored Human-Inspired PersonA) is a large-scale dataset of grounded personas generated from real-world social media data, paired with an underlying social interaction graph. It is designed to support research in persona-based simulation, computational social science, and LLM-driven population modeling.
### Dataset Description
Large language models are increasingly used to simulate human behavior, opinions, and social dynamics. However, existing persona datasets often suffer from a trade-off between authenticity (grounded in real human data) and scalability (synthetically generated at scale).
To address this, we introduce SYNTHIA, a dataset of high-fidelity personas generated by grounding LLM outputs in real social media posts. Each persona is a first-person narrative constructed from a user’s historical activity, while preserving privacy through extensive anonymization and filtering.
Unlike prior persona datasets, SYNTHIA:
- Grounds personas in real user-generated content rather than purely synthetic sampling
- Preserves the social network structure between users as a graph
- Enables network-aware analysis, including homophily and link prediction
- Is designed for population-level simulation, not individual reconstruction
- **Curated by:** Joint work of [Vahid Rahimzadeh](https://scholar.google.com/citations?user=CTiPTggAAAAJ&hl=en) & [Erfan Moosavi Monazzah](https://scholar.google.com/citations?user=243ygCwAAAAJ&hl=en)
- **Funded by:** [Tehran Institute for Advanced Studies (TeIAS)](https://teias.institute/)
- **Shared by:** [LLMs Lab @ TeIAS](https://teias.ai/)
- **Language(s) (NLP):** English (EN)
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [teias-ai/synthia](https://huggingface.co/datasets/teias-ai/synthia)
- **Paper:** ACL Anthology (Coming Soon!) | [arXiv](https://arxiv.org/abs/2507.14922)
## Intended Uses
This dataset is intended for:
- Population-level simulations and survey response simulation
- Computational social science experiments
- Graph-based learning over social networks
- Studying bias, fairness, and representation in synthetic populations
- Training and evaluating persona-driven LLMs
## Dataset Structure
This dataset has the following splits:
| split name | Description |
|-----------------------------|------------------------------------------------------------------------------------------------------------|
| synthia_gemma_3_27b_it | Original created data, used in **Screening Stage** & **Detailed Analysis** sections of the paper |
| synthia_phi_4_mini_instruct | Original created data with smaller model, used only in **Screening Stage** section of the paper |
| anthology_llama_3_8b | Subset of data from **Anthology** dataset. See their work [here](https://github.com/CannyLab/anthology) |
| anthology_gemma_3_27b | The re-run of **Anthology** pipeline with the same model as **SYNTHIA** to get a fair baseline |
| persona_chat | Subset of data from **persona-chat** dataset. See their work [here](https://huggingface.co/datasets/AlekseyKorshuk/persona-chat) | |
A sample row of the dataset:
```python
{
'index': '02b0de5bd3d81bd1c22bc0ab518bcfca', <str>
'persona': 'My twenties were… a lot...', <str>
'followings': ['5bdcd696b9f20401e12332e75182b3c3', ..., '1361db0f2fea2ed3d3a8feae362aaeb9'], List[<str>]
'followers': ['1361db0f2fea2ed3d3a8feae362aaeb9', ..., '5bdcd696b9f20401e12332e75182b3c3'], List[<str>]
'extractive_persona': 'Since August. She’s 9mo I will 20...', <str>
'recency_persona': 'I don’t even know how to be supportive', <str>
}
```
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@misc{rahimzadeh2026synthiascalablegroundedpersona,
title={Synthia: Scalable Grounded Persona Generation from Social Media Data},
author={Vahid Rahimzadeh and Erfan Moosavi Monazzah and Mohammad Taher Pilehvar and Yadollah Yaghoobzadeh},
year={2026},
eprint={2507.14922},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.14922},
}
```
提供机构:
teias-ai



